0% found this document useful (0 votes)
8 views

incrementalDetr

The document presents Incremental-DETR, a method for incremental few-shot object detection that allows models to learn new classes without forgetting previously learned ones, using self-supervised learning and fine-tuning on the DETR object detector. The approach addresses the challenges of catastrophic forgetting and overfitting by separating training into two stages: pre-training on base classes and fine-tuning on novel classes with limited data. Experimental results demonstrate significant performance improvements over existing state-of-the-art methods in standard object detection datasets.

Uploaded by

ansh rusia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

incrementalDetr

The document presents Incremental-DETR, a method for incremental few-shot object detection that allows models to learn new classes without forgetting previously learned ones, using self-supervised learning and fine-tuning on the DETR object detector. The approach addresses the challenges of catastrophic forgetting and overfitting by separating training into two stages: pre-training on base classes and fine-tuning on novel classes with limited data. Experimental results demonstrate significant performance improvements over existing state-of-the-art methods in standard object detection datasets.

Uploaded by

ansh rusia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Incremental-DETR: Incremental Few-Shot Object Detection via Self-Supervised

Learning
Na Dong1,2 * , Yongqiang Zhang2 , Mingli Ding2 , Gim Hee Lee1
1
Department of Computer Science, National University of Singapore
2
School of Instrument Science and Engineering, Harbin Institute of Technology
{dongna1994, zhangyongqiang, dingml}@hit.edu.cn, [email protected]
arXiv:2205.04042v3 [cs.CV] 27 Feb 2023

Abstract available (McCloskey and Cohen 1989; Ratcliff 1990). Ad-


ditionally, these deep learning-based detectors also suffer
Incremental few-shot object detection aims at detecting novel from the severe over-fitting problem when large amounts of
classes without forgetting knowledge of the base classes with annotated training data that are costly and tedious to obtain
only a few labeled training data from the novel classes. Most
related prior works are on incremental object detection that
becomes scarce. In contrast, humans are much better than
rely on the availability of abundant training samples per novel machines in continually learning novel concepts without for-
class that substantially limits the scalability to real-world set- getting previously learned knowledge despite the absence of
ting where novel data can be scarce. In this paper, we propose previous examples and the availability of only a few novel
the Incremental-DETR that does incremental few-shot object examples. This gap between humans and machine learning
detection via fine-tuning and self-supervised learning on the algorithms fuels the interest in incremental few-shot object
DETR object detector. To alleviate severe over-fitting with few detection, which aims at continually extending the model to
novel class data, we first fine-tune the class-specific compo- novel classes without forgetting the base classes with only a
nents of DETR with self-supervision from additional object few samples per novel class.
proposals generated using Selective Search as pseudo labels.
We further introduce an incremental few-shot fine-tuning strat- In this paper, we focus on solving the problem of incre-
egy with knowledge distillation on the class-specific com- mental few-shot object detection. To this end, we propose
ponents of DETR to encourage the network in detecting the Incremental-DETR that does incremental few-shot object
novel classes without forgetting the base classes. Extensive detection via fine-tuning and self-supervised learning on the
experiments conducted on standard incremental object detec- recently proposed DETR object detector (Zhu et al. 2020).
tion and incremental few-shot object detection settings show We are inspired by the fine-tuning technique commonly used
that our approach significantly outperforms state-of-the-art in few-shot object detectors (Wang et al. 2020; Wu et al.
methods by a large margin. Our source code is available at 2020; Sun et al. 2021) based on the two-stage Faster R-CNN
https://github.com/dongnana777/Incremental-DETR.
framework with class-agnostic feature extractor and Region
Proposal Network (RPN). In the first stage, the whole net-
Introduction work is trained on abundant base data. In the second stage,
the class-agnostic feature extractor and RPN are frozen, and
In the past decade, many impressive general object detectors only the prediction heads are fine-tuned on a balanced subset
have been developed due to the huge success of deep learn- that consists of both base and novel classes. However, many
ing (Girshick et al. 2014b; Uijlings et al. 2013; Girshick 2015; few-shot object detectors focus on detecting the novel classes
Lin et al. 2017a; Redmon et al. 2016; Liu et al. 2016; Lin but fail to preserve the performance of the base classes. In
et al. 2017b; Carion et al. 2020; Zhu et al. 2020). However, contrast, we consider the more challenging and practical
most deep learning-based object detectors can do detection incremental few-shot object detection which is capable of
on objects only from a fixed set of base classes that are seen detecting both the novel and base classes. Furthermore, the
during training. The extension of the object detector to addi- data of the base classes are no longer accessible when the
tional unseen novel classes without losing performance on novel classes are introduced in incremental few-shot object
the base classes requires further training on large amounts detection. However, the training data of base classes are still
of training data from both novel and base classes. A naive accessible when training the novel model in few-shot object
fine-tuning with training data only from the novel classes detection by which the novel model is easier to maintain the
can lead to the notorious catastrophic forgetting problem, performance of the base classes.
where knowledge of the base classes is quickly forgotten
The state-of-the-art DETR object detector consists of a
when the training data from the base classes are no longer
CNN backbone that extracts features from input images, a
* Work fully done while first author is a visiting PhD student at projection layer that reduces the channel dimension of fea-
the National University of Singapore. tures, an encoder-decoder transformer that transforms the
Copyright © 2023, Association for the Advancement of Artificial features into features of a set of object queries, a 3-layer feed
Intelligence (www.aaai.org). All rights reserved. forward network that acts as the regression head, and a linear
projection that acts as the classification head. We unfreeze Related Works
different layers of DETR for fine-tuning, and empirically
Object Detection. Existing deep object detection models
identify that the projection layer and classification head are
generally fall into two categories: 1) two-stage and 2) one-
class-specific and the CNN backbone, transformer and regres-
stage detectors. Two-stage detectors such as R-CNN (Gir-
sion head of DETR are class-agnostic. The key part of our
shick et al. 2014b) apply a deep neural network to extract
method is to separate the training of the class-agnostic and
features from proposals generated by selective search algo-
the class-specific components of DETR into two stages: 1)
rithm (Uijlings et al. 2013). Fast R-CNN (Girshick 2015)
base model pre-training and self-supervised fine-tuning, and
utilizes a differentiable RoI Pooling to improve the speed
2) incremental few-shot fine-tuning. Specifically, the whole
and performance. Faster R-CNN (Ren et al. 2015) introduces
network is pre-trained on abundant data from the base classes
the Region Proposal Network (RPN) to generate proposals.
in the first part of the first stage. In the next part of the first
FPN (Lin et al. 2017a) builds a top-down architecture with
stage, we propose a self-supervised learning method to fine-
lateral connections to extract features across multiple layers.
tune the class-specific projection layer and classification head
In contrast, one-stage detectors such as YOLO (Redmon et al.
together with the available abundant base class data. In the
2016) directly perform object classification and bounding
second stage, the class-agnostic CNN backbone, transformer
box regression on the features. SSD (Liu et al. 2016) uses
and regression head are kept frozen. We fine-tune the class-
feature pyramid with different anchor sizes to cover the pos-
specific projection layer and classification head on a few
sible object scales. RetinaNet (Lin et al. 2017b) proposes
examples of only the novel classes. Catastrophic forgetting is
the focal loss to mitigate the imbalanced positive and neg-
mitigated by identifying and freezing the class-agnostic com-
ative examples. Recently, another type of object detection
ponents in both stages. Our fine-tuning of the class-specific
methods (Carion et al. 2020; Zhu et al. 2020) beyond the one-
components in both stages alleviates the over-fitting problem
stage and two-stage methods have gained popularity. They
while giving the network ability to detect novel class objects.
directly supervise bounding box predictions end-to-end with
Intuitively, humans can easily observe and extract addi-
Hungarian bipartite matching. However, these detectors re-
tional meaningful object proposals in the background of the
quire large amounts of training images per class and need to
base class images from their prior knowledge. Based on this
train the detectors over many training epochs and thus suffer
intuition, we leverage the selective search algorithm (Uijlings
from catastrophic forgetting of base classes and over-fitting
et al. 2013) to generate additional object proposals that may
in the context of incremental few-shot learning. Therefore,
exist in the background as pseudo ground truths to fine-tune
it is imperative to extend the capability of the detectors to
the network in the first stage to detect the class-agnostic re-
novel categories with no access to the original base training
gion proposals. Specifically, we do multi-task learning only
data with only a few samples.
on the class-specific projection layer and classification head,
where the generate pseudo ground truths are used to self- Few-Shot Object Detection. Several earlier works in
supervise the model alongside the full-supervision from the few-shot object detection utilize the meta-learning strat-
abundant base class data. This helps the model to learn a more egy. MetaDet (Wang, Ramanan, and Hebert 2019), MetaY-
generalizable and transferable projection layer of DETR for OLO (Kang et al. 2019), Meta R-CNN (Wu, Sahoo, and Hoi
incremental few-shot object detection and generalize to novel 2020) and Meta-DETR (Zhang et al. 2021) use the meta-
classes without forgetting the base classes. Furthermore, we learner to generate a prototype per category from the support
propose a classification distillation loss and a masked feature data and aggregate these prototypes with the query features by
distillation loss in addition to the existing loss of DETR for channel-wise multiplication. In contrast to the meta-learning
our incremental few-shot fine-tuning stage to impede catas- based strategy, two-stage fine-tuning based methods show
trophic forgetting. Our main contributions are as follows: more potential in improving the performance of few-shot
• We propose the Incremental-DETR to tackle the challeng- object detection. For example, TFA (Wang et al. 2020) first
ing and relatively under-explored incremental few-shot trains Faster R-CNN on the base classes and then only fine-
object detection problem. tunes the predictor heads. MPSR (Wu et al. 2020) mitigates
• We identify that the projection layer and the classification the scale scarcity in the few-shot datasets. FSCE (Sun et al.
head of the DETR architecture are class-specific with 2021) resorts to contrastive learning to learn discriminative
empirical experimental studies. We thus adapt the fine- object proposal representations. Our proposed Incremental-
tuning strategy into our two-stage framework to avoid the DETR falls under the category of two-stage fine-tuning strat-
catastrophic forgetting and over-fitting problems. egy, but differs from existing approaches that build upon
• We design a self-supervised method for the model to learn Faster R-CNN for few-shot detection. Specifically, we con-
a more generalizable and transferable projection layer of sider the more challenging and practical incremental few-shot
DETR. Consequently, the model can efficiently adapt to object detection by incorporating the two-stage fine-tuning
the novel classes without forgetting the base classes in strategy into the recently proposed DETR framework. Incre-
spite of novel class data scarcity. mental few-shot object detection aims at learning a model of
novel classes without forgetting the base classes with only
• Extensive experiments conducted on two standard object a few samples per novel class. In contrast, many few-shot
detection datasets (i.e. MS COCO and PASCAL VOC) object detection works focus on detecting the novel classes
demonstrate the significant performance improvement of but fail to preserve the performance of the base classes. Fur-
our approach over existing state-of-the-arts. thermore, the training data of the base classes are no longer
Figure 1: Overview of our proposed base model training stage. Parameters of the modules shaded in green are frozen during
training. Refer to the text for more details.

accessible when the novel classes are introduced in the incre- classes and the novel classes as Dbase and Dnovel , respec-
mental few-shot object detection. However, the training data tively. Following the definition of class-incremental learning,
of base classes are still accessible when training the novel we only have access to the novel class data Dnovel , where
model in few-shot object detection by which the novel model ynovel ∈ {CB+1 , . . . , CB+N }. The base class data Dbase ,
is easier to maintain the performance of the base classes. where ybase ∈ {C1 , . . . , CB } are no longer accessible when
novel classes are introduced. Dbase and Dnovel have no over-
Incremental Few-Shot Object Detection. Incremental
lapped classes, i.e. {C1 , . . . , CB } ∩ {CB+1 , . . . , CB+N } =
few-shot object detection is first explored in ONCE (Perez-
∅. The training dataset Dbase has abundant annotated in-
Rua et al. 2020), which uses CenterNet (Zhou, Wang, and
stances of {C1 , . . . , CB }. In contrast, Dnovel has very few
Krähenbühl 2019) as a backbone to learn a class-agnostic
annotated instances of {CB+1 , . . . , CB+N } and are often de-
feature extractor and a per-class code generator network for
scribed as a N -way K-shot training set, where there are N
the novel classes. CenterNet (Zhou, Wang, and Krähenbühl
novel classes and each novel class has K annotated object in-
2019) is well-suited for incremental few-shot object detection
stances. The goal of incremental few-shot object detection is
since a separate heatmap that makes independent detection by
to continually learn novel classes {CB+1 , . . . , CB+N } from
an activation thresholding is maintained for each individual
only a few training examples without forgetting knowledge
object class. MTFA (Ganea, Boom, and Poppe 2021) extends
of the base classes {C1 , . . . , CB }.
TFA (Wang et al. 2020) in a similar way as Mask R-CNN ex-
tends Faster R-CNN. MTFA is then adapted into incremental
MTFA (iMTFA) by learning discriminative embeddings for Our Methodology
object instances that are merged into class representatives. In
this paper, we do incremental few-shot object detection on the Base Model Training
DETR object detector. However, DETR does not have a sep-
arate heatmap for each class and cannot be easily decoupled As shown in Figure 1, we split the first stage of our
to independent detection for each class. Furthermore, DETR incremental-DETR into base model pre-training and fine-
is a deeper network, and thus significantly more training data tuning. In base model pre-training, we train the whole model
and computation time are needed for convergence. Since we only on the base classes {C1 , . . . , CB } with the same loss
only have access to a few samples of the novel classes in functions used in DETR. Following DETR, we denote the
the incremental few-shot setting and the data of base classes set of M predictions for base classes as ŷ = {ŷi }M i=1 =
are no longer accessible, DETR is easier to catastrophically {(ĉi , b̂i )}M
i=1 , and the ground truth set of objects as y =
forget the base classes due to the absence of the base data {yi }M = {(c i , b )}
i i=1
M
padded with ∅ (no object). For
i=1
and overfit to the novel classes due to the novel data scarcity. each element i of the ground truth set, ci is the target class
label (which may be ∅ ) and bi ∈ [0, 1]4 is a 4-vector that
Problem Definition defines ground truth bounding box center coordinates, and
Let (x , y) ∈ D denotes a dataset D which contains im- its height and width relative to the image size. We adopt a
ages x and their corresponding ground truth set of ob- pair-wise matching cost Lmatch (yi , ŷσ(i) ) between the ground
jects y. We further denote the training dataset of the base truth yi and a prediction ŷσ(i) with index σ(i) to search for a
Figure 2: Overview of our proposed incremental few-shot fine-tuning stage. Parameters of the modules shaded in green are
frozen during training. Refer to the text for more details.

bipartite matching with the lowest cost: object proposals. However, the number of object proposals
M
is large and the ranking is not precise. To circumvent this
X problem, we use the object ranking from the selective search
σ̂ = arg min Lmatch (yi , ŷσ(i) ). (1)
σ algorithm to prune away imprecise object proposals. Specifi-
i=1
cally, we select the top O objects in the ranking list that are
The matching cost takes into account both the class prediction also not overlapping with the ground truth objects of the base
ĉσ(i) , and the similarity of the predicted b̂σ(i) and ground classes as the pseudo ground truth object proposals. We repre-
truth bi boxes. Specifically, it is defined as: sent the pseudo ground truth object proposals as b0 . To make
predictions on these selected object proposals, we follow the
prediction head of DETR to introduce a new class label c0 for
Lmatch (yi , ŷσ(i) ) = 1{ci 6=∅} Lcls (ci , ĉσ(i) )
(2) all the selected object proposals as the pseudo ground truth
+ 1{ci 6=∅} Lbox (bi , b̂σ(i) ). label along with the labels of the base classes. Specifically, c0
depends on the number of categories of different sets. If there
Given the above definitions, the Hungarian loss (Kuhn are n categories, we set c0 as n + 1 along with the ground
1955) for all pairs matched in the previous step is defined as: truth labels (1, . . . , n).
M
X Let us now denote the pseudo ground truth set of the se-
Lhg (y, ŷ) = [Lcls (ci , ĉσ̂(i) ) + 1{ci 6=∅} Lbox (bi , b̂σ̂(i) )], lected object proposals as y 0 = {yi0 }P 0 0 P
i=1 = {(ci , bi )}i=1
i=1 padded with ∅ (no object) and the set of P predictions as
(3) ŷ = {ŷi }P P
i=1 = {(ĉi , b̂i )}i=1 . The same pair-wise matching
where σ̂ denotes the optimal assignment between predictions cost Lmatch (yi0 , ŷσ0 (i) ) between a pseudo ground truth yi0 and
and targets. Lcls is the sigmoid focal loss (Lin et al. 2017b).
a prediction ŷσ0 (i) with index σ 0 (i) is also adopted to search
Lbox is a linear combination of `1 loss and generalized IoU
for a bipartite matching with the lowest cost. Finally, the
loss (Rezatofighi et al. 2019) with the same weight hyperpa-
Hungarian loss for all the matched pairs is defined as:
rameters as DETR.
To make the parameters of class-specific components of P
DETR generalize well to the novel classes with a few samples X
Lhg (y 0 , ŷ) = [Lcls (c0i , ĉσ̂0 (i) ) + 1{c0i 6=∅} Lbox (b0i , b̂σ̂0 (i) )],
in the incremental few-shot fine-tuning stage, we propose to
i=1
fine-tune the class-specific components in a self-supervised (4)
way while keeping the class-agnostic components frozen in where σ̂ 0 denotes the optimal assignment between predictions
the base model fine-tuning of the first stage, where the pa- and the pseudo ground truths. Lcls is the sigmoid focal loss
rameters of the model is initialized from the pre-trained base which does binary classification between the selected object
model. The base model fine-tuning relies on making predic- proposals and the background. Lbox is a linear combination
tions on the other potential class-agnostic objects of base data of `1 loss and generalized IoU loss with the same weight
along with ground truth objects of base classes. Inspired by hyperparameters as DETR.
R-CNN (Girshick et al. 2014a) and Fast R-CNN (Girshick
2015), we use the selective search algorithm (Uijlings et al. The overall loss Lbase
total to fine-tune the base model on the
2013) to generate a set of class-agnostic object proposals abundant base data Dbase is given by:
for each of the raw images. The selective search is a well- 0 0
established and very effective unsupervised object proposal Lbase
total = Lhg (y, ŷ) + λ Lhg (y , ŷ), (5)
generation algorithm which uses color similarity, texture sim-
ilarity, size of region and fit between regions to generate where λ0 is the hyperparameter to balance the loss terms.
Incremental Few-Shot Fine-Tuning Experiments
Experimental Setup
As shown in Figure 2, we first initialize the parameters of
the novel model from the fine-tuned base model in the first Incremental object detection. We first evaluate the per-
stage. We then propose to fine-tune the class-specific projec- formance of our proposed method on the incremental setting
tion layer and classification head with a few samples of the as studied in (Shmelkov, Schmid, and Alahari 2017; Kj et al.
novel classes while keeping the class-agnostic components 2021). We conduct the evaluation on the popular object detec-
frozen. However, the constant updating of the projection layer tion benchmark MS COCO 2017 (Lin et al. 2014) which cov-
and classification head in the process of novel classes learn- ers 80 object classes. The train set of COCO serves as training
ing can aggravate catastrophic forgetting of the base classes. data and the val set serves as testing data. Standard evalua-
Therefore, we propose to use knowledge distillation to miti- tion metrics for COCO are adopted. Following (Shmelkov,
gate catastrophic forgetting. Specifically, the base model is Schmid, and Alahari 2017; Kj et al. 2021), we report the
utilized to prevent the projection layer output features of the results on the incremental learning setting of adding a group
novel model from deviating too much from the projection of novel classes (40+40). Furthermore, results on the setting
layer output features of the base model. However, a direct of adding one novel class (40+1) are also reported to prove
knowledge distillation on the full features causes conflicts the effectiveness of our method.
and thus hurts the performance of the novel classes. We thus Incremental few-shot object detection. We follow the
use the ground truth bounding boxes of the novel classes as data setups of prior works on incremental few-shot object
a binary mask mask novel (1 within ground truth boxes, oth- detection, e.g. (Perez-Rua et al. 2020). Specifically, we con-
erwise 0) to prevent negative influence on the novel class duct the experimental evaluations on two widely used object
learning from features of the base model. The distillation loss detection benchmarks: MS COCO 2017 and PASCAL VOC
with the mask on the features is written as: 2007 (Everingham et al. 2010). The train set of COCO serves
as training data and the val set serves as testing data. The
h X
w X c
1 2 trainval set of VOC serves as training data and the test set
Lkd (1 − mask novel novel base
X
feat = ij ) fijk − fijk ,
2N novel serves as testing data. Standard evaluation metrics for COCO
i=1 j=1 k=1
(6) are adopted. COCO contains objects from 80 different classes
Pw Ph including 20 classes that intersect with VOC. We adopt the 20
where N novel = i=1 j=1 (1 − mask novel ij ). f base
and f novel
shared classes as novel classes and the remaining 60 classes
denote the features of the base and novel models, respectively. as base classes. Following (Perez-Rua et al. 2020), there are
w , h and c are the width, height and channels of the feature. two dataset splits: same-dataset on COCO and cross-dataset
For the knowledge distillation on the classification head with 60 COCO classes as base classes. For the same-dataset
of DETR, we first select the prediction outputs from the M evaluation on COCO, the model is pre-trained and fine-tuned
prediction outputs of the base model as pseudo ground truths on the COCO base training data. The novel classes learning
of the base classes. Specifically, for an input novel image, we is then conducted on COCO novel training data and evaluated
consider a prediction output of the base model as a pseudo on COCO testing data. We use the same setup as above for
ground truth of the base classes when its class probability the cross-dataset evaluation on COCO to VOC, where the
greater than 0.5 and its bounding box has no overlap with model is pre-trained and fine-tuned on the COCO base train-
ground truth bounding boxes of the novel classes. We then ing data. However, the incremental few-shot fine-tuning stage
adopt a pair-wise matching cost to find for the bipartite match- is conducted on the VOC training data and evaluated on the
ing between the pseudo ground truths and the predictions of VOC testing data. We evaluate incremental few-shot object
the novel model. Subsequently, the classification outputs of detection with 1, 5, 10-shot per novel class and the base train-
the base and novel models are compared in the distillation ing data is no longer accessible during novel classes learning
loss function given by: following the definition of class-incremental learning.

Lkd
cls = Lkl_div (log(q
novel
), q base ), (7) Implementation Details
We use ResNet-50 as the feature extractor. The network ar-
where we follow (Hinton, Vinyals, and Dean 2015) in the chitectures and hyperparameters of transformer encoder and
definition of the KL-divergence loss Lkl_div between the class decoder remain the same as Deformable DETR (Zhu et al.
probabilities of the novel and base models. q denotes the 2020). λ0 , λfeat and λcls are set to 1, 0.1 and 2, respectively.
class probability. We report the results of our proposed method with single-
The Hungarian loss Lhg is also applied to the ground truth scale feature. The training is carried out on 8 RTX 6000 GPUs
set y and the predictions ŷ of the novel data Dnovel . The with a batch size of 2 per GPU. In the base model pre-training
overall loss Lnovel
total to train the novel model on the novel data stage, we train our model using the AdamW optimizer and
Dnovel is given by: an initial learning rate of 2 × 10−4 and a weight decay of
1 × 10−4 . We train our model for 50 epochs and the learning
Lnovel kd kd
total = Lhg (y, ŷ) + λfeat Lfeat + λcls Lcls , (8) rate is decayed at 40th epoch by a factor of 0.1. In the base
model fine-tuning stage, the model is initialized from pre-
where λfeat and λcls are hyperparameters to balance the loss trained base model. The parameters of the projection layer
terms. and classification head are then fine-tuned while keeping the
Base Novel All
Shot Method
AP(%) AP50(%) AP(%) AP50(%) AP(%) AP50(%)
iMTFA (Ganea, Boom, and Poppe 2021) 38.2 58.0 - - - -
Base
(Zhu et al. 2020) 36.9 56.7 - - - -
ONCE (Perez-Rua et al. 2020) 17.9 - 0.7 - 13.6 -
1 iMTFA (Ganea, Boom, and Poppe 2021) 27.8 40.1 3.2 5.9 21.7 31.6
Ours 29.4 47.1 1.9 2.7 22.5 36.0
ONCE (Perez-Rua et al. 2020) 17.9 - 1.0 - 13.7 -
5 iMTFA (Perez-Rua et al. 2020) 24.1 33.7 6.1 11.2 19.6 28.1
Ours 30.5 48.4 8.3 13.3 24.9 39.6
ONCE (Perez-Rua et al. 2020) 17.9 - 1.2 - 13.7 -
10 iMTFA (Ganea, Boom, and Poppe 2021) 23.4 32.4 7.0 12.7 19.3 27.5
Ours 27.3 44.0 14.4 22.4 24.1 38.6

Table 1: Results of incremental few-shot object detection on COCO val set.

other parameters frozen. We fine-tune the model for 1 epoch the next 40 classes as the additional group of novel classes.
with a learning rate of 2 × 10−4 . We apply the same setting as "1-80" and "1-40" are baselines trained without using incre-
the base model pre-training stage in the incremental few-shot mental learning. "41-80 (fine-tune)" and "41-80 (scratch)" are
fine-tuning stage. the model trained with and without initialization from the pre-
trained base model, respectively. The performance of "41-80
Incremental Object Detection (scratch)" is better than "41-80 (fine-tune)" due to the abun-
Addition of one class. Table 2 shows the results on the dance of novel class data. Our method achieves results with
incremental learning setting of one novel class addition. AP and AP50 that are almost on par with the baseline training
We take the first 40 classes from MS COCO as the base on base classes "1-40" without using incremental learning
classes, and the next 41th class is used as the novel class. (Ours: 44.0% | 66.1% vs. "1-40": 44.8% | 67.8%). Addition-
We report the average precision (AP) and the average pre- ally, our method significantly outperforms (Kj et al. 2021)
cision with 0.5 IoU threshold (AP50) of the base, novel on all classes (Ours: 37.3% | 56.6% vs. (Kj et al. 2021):
and all classes, respectively. "1-41" and "1-40" are base- 23.8% | 40.5%). Our method also outperforms (Shmelkov,
lines trained without using incremental learning. "41 (fine- Schmid, and Alahari 2017) under the incremental setting
tune)" and "41 (scratch)" are the model trained with and (Ours: 37.3% | 56.6% vs. (Shmelkov, Schmid, and Alahari
without initialization from the pre-trained base model, re- 2017): 21.3% | 37.4%). These results show the effectiveness
spectively. The results of "41 (scratch)" are significantly of our approach when a group of novel classes is added.
lower than "41 (fine-tune)" due to the scarcity of the novel
Base Novel All
data. Our approach achieves high results with AP and AP50 Class Method
AP(%) AP50(%) AP(%) AP50(%) AP(%) AP50(%)
of 41.9% | 64.1%, 29.2% | 50.3%, 41.6% | 63.8% for the 1-80 (Zhu et al. 46.8 69.4 36.3 54.7 41.4 61.8
2020)
base, novel and all classes, respectively. Our results are al- 1-40 (Zhu et al. 44.8 67.8 - - - -
most on par with the results from baseline training on the 2020)
41-80
novel class "41" with fine-tuning (Ours: 29.2% | 50.3% vs. (fine-tune)
(Zhu et al.
2020)
0 0 33.0 49.6 16.5 24.8

"41 (fine-tune)": 30.7% | 50.3%). Furthermore, our method 41-80


(Zhu et al. - - 35.0 52.6 - -
also achieves results that are close to the baseline training (scratch)
2020)
on base classes "1-40" (Ours: 41.9% | 64.1% vs. "1-40": (1-40)
(Shmelkov,
Schmid, and
- - - - 21.3 37.4

44.8% | 67.8%). These results support the effectiveness of + (41-80)


Alahari 2017)
our incremental learning framework in preserving base class (Kj et al. 2021) - - - - 23.8 40.5
Ours 44.0 66.1 30.6 47.1 37.3 56.6
knowledge and the learning of novel class knowledge.
Table 3: Results of "40+40" on COCO val set. "1-40" is the base
Base Novel All
Class Method
AP(%) AP50(%) AP(%) AP50(%) AP(%) AP50(%) classes, and "41-80" is the novel classes.
1-41 (Zhu et al. 45.6 68.6 31.3 55.7 45.3 68.3
2020)
1-40 (Zhu et al. 44.8 67.8 - - - -
2020) Incremental Few-Shot Object Detection
41
(fine-tune)
(Zhu et al. 15.6 23.9 30.7 50.3 16.0 24.6 Same-dataset. In this experiment, we show the effective-
2020)
41 ness of our method in learning to detect novel classes with-
(Zhu et al. - - 16.7 32.4 - -
(scratch)
2020) out forgetting base classes with only a few novel sam-
(1-40)
Ours 41.9 64.1 29.2 50.3 41.6 63.8 ples. The first two rows of Table 1 show the results of
+ (41)
iMTFA (Ganea, Boom, and Poppe 2021) and Deformable
Table 2: Results of "40+1" on COCO val set. "1-40" is the base DETR (Zhu et al. 2020) trained on the base classes without
classes, and "41" is the novel class. using incremental few-shot learning. Our method achieves
the best performance on the AP and AP50 for the base and
all classes. Our method significantly outperforms iMTFA
Addition of a group of classes. Table 3 shows the results and ONCE (Perez-Rua et al. 2020) on the base, novel and
on the first 40 classes from MS COCO as the base classes, and all classes, except for 1-shot setting of the novel classes
Two-stage Self-supervised learning Knowledge distillation Base Novel All
Row ID
fine-tuning Lbox Lcls Lkd
feat Lkd
cls AP(%) AP50 (%) AP(%) AP50(%) AP(%) AP50(%)
1 0.1 0.2 1.4 2.5 0.4 0.8
2 X 19.7 32.5 5.2 8.2 16.1 26.4
3 X X X 16.3 27.1 8.0 12.9 14.2 23.5
4 X X X X 23.8 38.0 8.3 13.2 19.9 31.8
5 X X X X 26.1 43.0 8.0 12.7 21.6 35.4
6 X X X 30.7 49.0 5.1 8.5 24.3 38.9
7 X X X X 30.3 48.2 7.5 12.2 24.6 39.2
8 X X X X X 30.5 48.4 8.3 13.3 24.9 39.6

Table 4: Ablation studies with 5-shot per novel class on COCO val set.

Base model pre-training Incremental few-shot fine-tuning Base Novel All


Row ID
Two-step One-step Frozen Unfrozen AP(%) AP50(%) AP(%) AP50(%) AP(%) AP50(%)
1 X X 28.3 45.2 7.1 11.3 23.0 36.7
2 X X 4.3 9.8 3.5 6.2 4.1 8.9
3 X X 30.5 48.4 8.3 13.3 24.9 39.6

Table 5: Ablation studies with 5-shot per novel class on COCO val set.

where all methods perform poorly due to the extreme data of the components of our method. The subsequent settings
scarcity. Particularly, our method significantly outperforms in Row 2-8 include various combinations of the components
iMTFA on the AP and AP50 for all classes, i.e. iMTFA: from our framework. Our complete framework achieves the
(21.7% | 31.6%, 19.6% | 28.1%, 19.3% | 27.5%) vs Ours: best performance of all classes. These results further show
(22.5% | 36.0%, 24.9% | 39.6%, 24.1% | 38.6%) under the the effectiveness of the two-stage fine-tuning, self-supervised
1, 5 and 10-shot settings, respectively. These results show the learning and knowledge distillation strategies in our method.
superiority of our method for incremental few-shot object We combine the two steps from the base model pre-training
detection. stage into a single step and train the model from scratch with
the pseudo targets generated by the selective search algorithm
Cross-dataset. We report the performance of incremen-
and the targets from the abundant base class data. As shown in
tal few-shot object detection in a cross-dataset setting from
Row 1 of Table 5, we can see that the results of base and novel
COCO to VOC to demonstrate the domain adaptation ability
classes drop (Ours: 30.5% vs One-stage: 28.3% AP for base
of our method, where the domain of the novel model train-
classes, Ours: 8.3% vs One-stage: 7.1% AP for novel classes)
ing dataset is different from the base model training dataset.
in one-step base model pre-training stage. Furthermore, we
We only report the performance of the novel classes since
report the effectiveness of keeping the class-agnostic modules
there are no base classes in VOC. As shown in Table 6, our
of the model frozen in the incremental few-shot fine-tuning
method achieves much better result of 16.6% AP compared
stage. As shown in Row 2 of Table 5, we can see that the
to ONCE (Perez-Rua et al. 2020) with 2.4% AP in the 5-shot
results of base and novel classes drop significantly when
setting. Furthermore, we can see that our method also signifi-
unfreezing the class-agnostic CNN backbone, transformer
cantly outperforms ONCE (Ours: 24.6% vs. ONCE: 2.6%)
and regression head in incremental few-shot learning stage.
in the 10-shot setting.
These results demonstrate that each component in our method
Novel has a critical role for incremental few-shot learning.
Shot Method
AP(%) AP50(%)
ONCE (Perez-Rua et al. 2020) - -
1
Ours 4.1 6.6 Conclusion
ONCE (Perez-Rua et al. 2020) 2.4 - In this paper, we present a novel incremental few-shot object
5
Ours 16.6 26.3
ONCE (Perez-Rua et al. 2020) 2.6 - detection framework: the Incremental-DETR for the more
10
Ours 24.6 38.4 challenging and realistic scenario with no samples from the
base classes and only a few samples from the novel classes in
Table 6: Results of incremental few-shot object detection on VOC the training dataset. To circumvent the challenges in DETR
test set. for incremental few-shot object detection, we identify that
the projection layer and the classification layer of the DETR
architecture are class-specific and other modules like the
Ablation Studies transformer are class-agnostic with empirically experimental
Table 4 shows the ablation studies to understand the effec- studies. We then propose the use of two-stage fine-tuning
tiveness of each component in our framework. The first row strategy and self-supervised learning to retain the knowledge
is the result of the model initialized from the pre-trained base of the base classes and to learn a better generalizable pro-
model, and then trained with the standard Deformable DETR jection layer to adapt to the novel classes. Furthermore, we
loss on a few novel samples without using any component in- utilize a knowledge distillation strategy to transfer knowl-
troduced in our framework. We can see that this setting gives edge from the base to the novel model using the few samples
the lowest performance, which is evident on the importance from the novel classes. Extensive experimental results on the
benchmark datasets show the effectiveness of our proposed Girshick, R.; Donahue, J.; Darrell, T.; and Malik, J. 2014a.
method. We hope our work can provide good insights and Rich feature hierarchies for accurate object detection and se-
inspire further researches in the relatively under-explored but mantic segmentation. In Proceedings of the IEEE conference
important problem of incremental few-shot object detection. on computer vision and pattern recognition, 580–587.
Girshick, R.; Donahue, J.; Darrell, T.; and Malik, J. 2014b.
Acknowledgements Rich feature hierarchies for accurate object detection and se-
The first author is funded by a scholarship from the China mantic segmentation. In Proceedings of the IEEE conference
Scholarship Council (CSC). This research/project is sup- on computer vision and pattern recognition, 580–587.
ported by the National Research Foundation Singapore and He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. 2020.
DSO National Laboratories under the AI Singapore Pro- Momentum contrast for unsupervised visual representation
gramme (AISG Award No: AISG2-RP-2020-016) and the learning. In Proceedings of the IEEE/CVF Conference on
Tier 2 grant MOE-T2EP20120-0011 from the Singapore Min- Computer Vision and Pattern Recognition, 9729–9738.
istry of Education.
Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distill-
ing the knowledge in a neural network. arXiv preprint
References arXiv:1503.02531.
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, Kang, B.; Liu, Z.; Wang, X.; Yu, F.; Feng, J.; and Darrell, T.
A.; and Zagoruyko, S. 2020. End-to-end object detection 2019. Few-shot object detection via feature reweighting. In
with transformers. In European Conference on Computer Proceedings of the IEEE/CVF International Conference on
Vision, 213–229. Springer. Computer Vision, 8420–8429.
Carlucci, F. M.; D’Innocente, A.; Bucci, S.; Caputo, B.; and
Kj, J.; Rajasegaran, J.; Khan, S.; Khan, F. S.; and Balasub-
Tommasi, T. 2019. Domain generalization by solving jigsaw
ramanian, V. N. 2021. Incremental Object Detection via
puzzles. In Proceedings of the IEEE/CVF Conference on
Meta-Learning. IEEE Transactions on Pattern Analysis and
Computer Vision and Pattern Recognition, 2229–2238.
Machine Intelligence.
Caron, M.; Bojanowski, P.; Joulin, A.; and Douze, M. 2018.
Kuhn, H. W. 1955. The Hungarian method for the assignment
Deep clustering for unsupervised learning of visual features.
problem. Naval research logistics quarterly, 2(1-2): 83–97.
In Proceedings of the European Conference on Computer
Vision (ECCV), 132–149. Lin, T. Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; and
Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; Belongie, S. 2017a. Feature Pyramid Networks for Object
and Joulin, A. 2020. Unsupervised learning of visual fea- Detection. In IEEE Conference on Computer Vision and
tures by contrasting cluster assignments. arXiv preprint Pattern Recognition.
arXiv:2006.09882. Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Dollár, P.
Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020. 2017b. Focal loss for dense object detection. In Proceedings
A simple framework for contrastive learning of visual repre- of the IEEE international conference on computer vision,
sentations. In International conference on machine learning, 2980–2988.
1597–1607. PMLR. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.;
Doersch, C.; Gupta, A.; and Efros, A. A. 2015. Unsupervised Ramanan, D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft
visual representation learning by context prediction. In Pro- coco: Common objects in context. In European conference
ceedings of the IEEE international conference on computer on computer vision, 740–755. Springer.
vision, 1422–1430. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu,
Everingham, M.; Van Gool, L.; Williams, C. K.; Winn, J.; C. Y.; and Berg, A. C. 2016. SSD: Single Shot MultiBox
and Zisserman, A. 2010. The pascal visual object classes Detector. In European Conference on Computer Vision.
(voc) challenge. International journal of computer vision, McCloskey, M.; and Cohen, N. J. 1989. Catastrophic inter-
88(2): 303–338. ference in connectionist networks: The sequential learning
Ganea, D. A.; Boom, B.; and Poppe, R. 2021. Incremen- problem. In Psychology of learning and motivation, vol-
tal Few-Shot Instance Segmentation. In Proceedings of ume 24, 109–165. Elsevier.
the IEEE/CVF Conference on Computer Vision and Pattern Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; and
Recognition, 1185–1194. Efros, A. A. 2016. Context encoders: Feature learning by in-
Gidaris, S.; Bursuc, A.; Komodakis, N.; Pérez, P.; and Cord, painting. In Proceedings of the IEEE conference on computer
M. 2019. Boosting few-shot visual learning with self- vision and pattern recognition, 2536–2544.
supervision. In Proceedings of the IEEE/CVF International Perez-Rua, J.-M.; Zhu, X.; Hospedales, T. M.; and Xiang, T.
Conference on Computer Vision, 8059–8068. 2020. Incremental Few-Shot Object Detection. In Proceed-
Gidaris, S.; Singh, P.; and Komodakis, N. 2018. Unsupervised ings of the IEEE/CVF Conference on Computer Vision and
representation learning by predicting image rotations. arXiv Pattern Recognition, 13846–13855.
preprint arXiv:1803.07728. Ratcliff, R. 1990. Connectionist models of recognition mem-
Girshick, R. 2015. Fast r-cnn. In Proceedings of the IEEE ory: constraints imposed by learning and forgetting functions.
international conference on computer vision, 1440–1448. Psychological review, 97(2): 285.
Redmon, J.; Divvala, S.; Girshick, R.; and Farhadi, A. 2016.
You only look once: Unified, real-time object detection. In
Proceedings of the IEEE conference on computer vision and
pattern recognition, 779–788.
Ren, S.; He, K.; Girshick, R.; and Jian, S. 2015. Faster
R-CNN: Towards Real-Time Object Detection with Region
Proposal Networks. In International Conference on Neural
Information Processing Systems.
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.;
and Savarese, S. 2019. Generalized intersection over union: A
metric and a loss for bounding box regression. In Proceedings
of the IEEE/CVF conference on computer vision and pattern
recognition, 658–666.
Shmelkov, K.; Schmid, C.; and Alahari, K. 2017. Incremental
learning of object detectors without catastrophic forgetting.
In Proceedings of the IEEE International Conference on
Computer Vision, 3400–3409.
Sun, B.; Li, B.; Cai, S.; Yuan, Y.; and Zhang, C. 2021. FSCE:
Few-shot object detection via contrastive proposal encoding.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 7352–7362.
Uijlings, J. R.; Van De Sande, K. E.; Gevers, T.; and Smeul-
ders, A. W. 2013. Selective search for object recognition.
International journal of computer vision, 104(2): 154–171.
Wang, X.; Huang, T. E.; Darrell, T.; Gonzalez, J. E.; and
Yu, F. 2020. Frustratingly simple few-shot object detection.
arXiv preprint arXiv:2003.06957.
Wang, Y.-X.; Ramanan, D.; and Hebert, M. 2019. Meta-
learning to detect rare objects. In Proceedings of the
IEEE/CVF International Conference on Computer Vision,
9925–9934.
Wu, J.; Liu, S.; Huang, D.; and Wang, Y. 2020. Multi-
scale positive sample refinement for few-shot object detec-
tion. In European Conference on Computer Vision, 456–472.
Springer.
Wu, X.; Sahoo, D.; and Hoi, S. 2020. Meta-RCNN: Meta
learning for few-shot object detection. In Proceedings of the
28th ACM International Conference on Multimedia, 1679–
1687.
Zhai, X.; Oliver, A.; Kolesnikov, A.; and Beyer, L. 2019. S4l:
Self-supervised semi-supervised learning. In Proceedings of
the IEEE/CVF International Conference on Computer Vision,
1476–1485.
Zhang, G.; Luo, Z.; Cui, K.; and Lu, S. 2021. Meta-detr: Few-
shot object detection via unified image-level meta-learning.
arXiv preprint arXiv:2103.11731, 2.
Zhou, X.; Wang, D.; and Krähenbühl, P. 2019. Objects as
points. arXiv preprint arXiv:1904.07850.
Zhu, F.; Zhang, X.-Y.; Wang, C.; Yin, F.; and Liu, C.-L.
2021. Prototype Augmentation and Self-Supervision for
Incremental Learning. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
5871–5880.
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; and Dai, J. 2020.
Deformable detr: Deformable transformers for end-to-end
object detection. arXiv preprint arXiv:2010.04159.

You might also like