s11042-021-11137-y
s11042-021-11137-y
https://doi.org/10.1007/s11042-021-11137-y
Wei Liang1 · Pengfei Xu1 · Ling Guo1 · Heng Bai1 · Yang Zhou1 · Feng Chen1
Abstract
Due to the rapid development of science and technology, object detection has become a
promising research direction in computer vision. In recent years, most object detection
frameworks proposed in the existing research are 2D. However, 2D object detection cannot
take three-dimensional space into account, resulting in its inability to be used to solve prob-
lems in real world. Hence, we conduct this 3D object detection survey in the hope that 3D
object detection methods can be better applied to the contexts of intelligent video surveil-
lance, robot navigation and autonomous driving technology. There exist various 3D object
detection methods while in this paper we only focus on the popular deep learning based
methods. We divide these approaches into four categories according to the input data cate-
gory. Besides, we discuss the innovations of these frames and compare their experimental
results in terms of accuracy. Finally, we indicate the technical difficulties associated with
current 3D object detection and discuss future research directions.
1 Introduction
In recent years, both object detection and recognition frames have developed rapidly, while
their applications have become increasingly extensive. It is therefore interesting to deter-
mine the areas in which an object detection algorithm can be used. Areas like automatic car
driving systems, robot design, or detection and analysis of abnormal events in videos are all
inseparable from the frames of object detection and recognition. As with the rapid devel-
opment of deep learning methods, the object detection frames have gradually transformed
from traditional image processing-based methods to those based on deep neural networks.
Ling Guo
[email protected]
Wei Liang
[email protected]
Pengfei Xu
[email protected]
Traditional object detection and recognition methods mainly start from extracting fea-
tures by traditional image processing methods. The feature extraction methods include SIFT
[27, 68], HOG [57] and SURF [2]. Since features were manually designed in traditional
object detection and recognition methods, which is not adaptive to various parameters and
configurations, 2D frameworks has caught the eyes of many researchers.
The 2D framework can be divided into two stage framework and one stage frame-
work. For two-stage object detection frameworks, the object detection procedure is mainly
completed by a convolutional neural network, which extracts CNN convolution features.
Training the network involves two main steps. The first step is to train the RPN network [5],
while the second step is to train the network to detect the object area. For One-stage object
detection networks [7, 21, 24, 33, 45, 59], the category and location information is given
directly through the backbone network, and the RPN network is not used. Moreover, the
category and location information can be obtained directly through the backbone network
without the RPN network. Although these approaches are faster, the accuracy is lower than
the two-stage network. Typical one-stage object detection networks include YOLOv1 [46],
YOLOv2 [43], YOLOv3 [44], SSD [34], DSSD [18], Retina-Net [33], etc.
The above methods are all deep learning methods for image-based 2D object detection
and recognition. Such deep learning-based 2D object detection frameworks have greatly
improved in terms of their accuracy and real-time capabilities compared to the traditional
feature-based machine learning methods. They have achieved remarkable results when
tested on KITTI [19], COCO [3] and other public datasets.
However, because two-dimensional object detection is only used to return the pixel
coordinates of the object, and consequently lacks the depth, size and other physical world
parameter information, these approaches have certain limitations in practical applications.
Especially when it comes to the area of autonomous driving and robots, it is often necessary
to combine sensors such as lidar and millimeter wave to achieve multi-modal fusion frames
in order to improve reliability in the perception system.
So, how are we to fetch the information of the object three -dimensional space? Let us
consider autonomous driving as an example: modern autonomous vehicles are equipped
with lidar and cameras. The advantage of lidar is its ability to obtain the three-dimensional
information of the 3D space object directly. Besides the depth information is highly accu-
rate and the cost is slightly higher. The advantage of the camera is that it can save more
detailed semantic information, although it also needs to correctly compute the correspon-
dence between the image point and the three-dimensional point relationship a task that
requires a relatively high-level framework. Even if the automatic driving system needs to
detect and identify three-dimensional scene information through images, lidar and other
signals, it helps the computer determine the positional relationship between itself and the
surrounding objects and make correct driving judgments and decisions.
Accordingly, object detection in the three-dimensional space forms the basis for the
operation of the automatic driving system. Improving the object detection efficiency of the
three-dimensional scene as much as possible and reducing the detection cost has become an
urgent problem to be solved. In addition, in the contexts of autonomous driving technology,
such as indoor robot navigation, augmented reality and other fields, 2D object detection has
been unable to complete the task well. Therefore, 3D object detection will be a more signif-
icant development direction in computer vision. As deep learning has developed, 3D object
detection, and recognition approaches based on deep learning have become more mature.
In this paper, firstly, we divide the 3D object detection methods into four types depending
on the type of input data and briefly discuss the advantages and disadvantages of different
methods. And then, the framework of each method is discussed in detail. Finally, we discuss
Multimedia Tools and Applications (2021) 80:29617–29641 29619
developing trends in 3D object detection and discuss its future. The main contributions of
this paper are as follows
– survey many recent 3D object detection methods and classify them reasonably.
– summarize the disadvantages and advantages of many 3D detection methods
– compare and analyze the performance of recent 3D detection methods
– discuss the future technique trend of 3D detection methods.
The remainder of this paper is structured as follows. Section 2 briefly reviews the exist-
ing 3D object detection methods. Section 3 introduces various methods in detail. Section 4
compares the performance of each method and describes the technical difficulties associ-
ated with current 3D object detection and discusses future research directions. Section 5
summarizes the full text and predicts the future trend of 3D object detection.
Fig. 1 The overview of methods: 3D object detection methods can be applied with autonomous vehicle,
robot navigation and so on. Depending on the types of input data, they are divided into four categories:
methods based on original point cloud; Methods based on monocular image; methods based on multi-view
image; methods based on the fusion of point cloud and image. For each method, we show simple framworks
of typical methods at right of figure
29620 Multimedia Tools and Applications (2021) 80:29617–29641
What is a 3D point cloud, and how are they used for object detection? Point cloud data
is a set of vectors in the 3D coordinate system. These vectors are usually expressed in
the form of three-dimensional coordinates: (X, Y, Z) and are mainly used to represent the
outer surface shape of an object. However, in addition to the geometric position information
represented by (X, Y, Z), a point cloud can also represent the RGB colors, depth, grey value
and segmentation results of a point.
3D point cloud data is mainly obtained using three methods: (1) direct generation by 3D
scanning equipment; (2) 3D reconstruction using 2d images; (3) reverse generation of the
point cloud from a 3D model.
Since 3D point cloud data is represented by a collection of unordered points data, it is
indispensable to preprocess the point cloud data before using the deep learning model to
handle such data.
There are two main preprocessing methods. The first involves projecting the point cloud
data onto a two-dimensional space, such as a forward or bird’s eye view; this enables inte-
gration of the image information from the camera to realise the combination of data from
different perspectives, after which the 3D object can be detected [65].
The other method involves dividing the point cloud data into space voxels with spatial
dependence. We can divide the three-dimensional space in this way and introduce the spatial
dependence into the point cloud data, then use 3D convolution for data processing purposes.
However, the accuracy of this method depends on the fineness of the three-dimensional
space segmentation, while the computational complexity of the 3D convolution is also
higher. The above twomethods are in common use.
Point cloud Richer spatial structure informa- The cost of acquiring data is expen-
tion; higher accuracy than image- sive; the original point cloud can-
based 3D object detection not provide texture information of
object; computationally expensive
Fusion of point Utilize point cloud and image at the Computationally expensive
cloud and image same time to get more accurate output
Multi-view image Fuse image features from different Need to calculate depth information
angles in images; has higher accu-
racy than Monocular-based method
Monocular image The data is easy to get; most of Lack of accurate depth information;
them are improved on the basis of lack of features of spatial struc-
2D object detection method ture; need prior information; need
to calculate depth information
Multimedia Tools and Applications (2021) 80:29617–29641 29621
Although using point clouds for 3D object detection can yield better detection results, fun-
damentally, this method requires a high data cost, and the point cloud cannot obtain the
texture information of the object. Since the monocular image can be acquired by the cam-
era, the monocular image acquisition cost is low. In addition, many image-based 2D object
detection methods are relatively mature, and 3D object detection methods based on monoc-
ular images can adopt their ideas. However, as a monocular image cannot restore the 3D
position and size of the object, the result will always be unsatisfactory if only a monocular
image is used to detect the objects. but for a specific object, if its unique prior information
can be obtained, the object’s size and position in the real world can be computed through
the use of deep learning and combined with the projected knowledge.
3D object detection is an significant task in the autonomous driving field. The input of 3D
object detection is largely obtained via accurate and expensive lidar technology. Although
accuracy is high, the cost is expensive. Therefore, if it were possible to not only reduce the
cost but also improve the detection accuracy, this would be a huge improvement. The multi-
view-based method may be able to solve this problem. A more appropriate method would
be to use multiple cameras or motion cameras to form a stereo vision system, or to obtain
the 3D point cloud of the object according to the depth camera or lidar.
The most important use of 3D object detection is automatic driving. Most autonomous
vehicles have been equipped with lidar and cameras. In order to improve the detection per-
formance, it is particularly important to use the fusion of laser radar and image information.
The image represents the real-world projection on the camera, while the lidar can capture
the world’s native 3D structure.
The advantages and disadvantages of the PCM have been introduced in Table 1. Thus, it
is already clear that using point cloud for object detection requires preprocessing, as well
as that preprocessing also usually involves losing some information of the original data.
Therefore, some researchers have proposed using the point cloud as input for detection
directly.
Pointnet is a typical method that takes the original point cloud as input. In [40], Qi et al.
show us how to train Pointnet to complete the tasks of 3D shape classification, shape seg-
mentation and semantic analysis (see Fig. 2). These authors also show the 3D characteristics
of computing from the network and provide an intuitive explanation of its performance.
The framework of Pointnet is composed of two parts: specifically, a classification net-
work and segmentation network. The classification network takes n points as input data,
then transforms the input and features, and then gathers the features of the points through the
maximum pooling; the output of framework is the classification score of the K categories.
29622 Multimedia Tools and Applications (2021) 80:29617–29641
Fig. 2 Overview of Pointnet: The classification network takes n points as input and outputs the classiffication
scores for k classes, the segmentation network is the extension of classification networks. It combines the
global and local features to output the score of each point
The segmentation network is an extension of the classification network, which connects the
global and local features and outputs the score of each point. The innovations of this article
are as follows: (1) it can transform the input point cloud by learning a spatial transformation
network (t-net), which can be intuitively understood as rotating the point cloud to a more
appropriate angle for classification or segmentation by continuously adjusting the last loss
function. The transformation network t-net at the feature level can further be understood as
transforming the point cloud at the feature level. (2) The global features are extracted by
Maxpool, which resolves the disorder in the point cloud. Although Pointnet can complete
the detection task, it cannot capture the local structure introduced by the metric space well,
which limits its ability to identify fine-grained categories and generalize a complex scene.
In 2017, in [41], Qi et al. introduced a hierarchical neural network based on [40], which can
recursively use Pointnet on nested partitions. By using the metric space distance, Pointnet++
can learn local features while increasing the context scale. The new point cloud learning
method can combine the features of different scales. In the experiment section, it is further
demonstrated that Pointnet++ network can more efficiently and robustly learn the features
of the deep point. Although both Pointnet and Pointnet++ are 3D object segmentation and
classification methods, they can also be used in 3D object detection. Their framework has
also been the basic method to extract point cloud features without preprocessing.
Similar to the two-stage object detection process described in the image, PCM is also
a two - stage method. In 2019, Shi et al. proposed that pointrcnn PointrCNN, which is a
two-stage method [49]: in the first stage, a small number of high-quality 3D proposals are
generated by dividing the point cloud into foreground and background objects, which avoids
the need to use a large number of predefined 3D frames in the 3D space and significantly
limits the search space of the generated 3D proposals; in the second stage, the frame of
coordinates are optimized by learning from the high robust standard coordinates based on
the area loss to obtain the final detection result.
As mentioned above, in addition to the PCM, another idea is to divide the space into a
number of voxel units. The original point cloud divides the points in the point cloud space
into voxels that correspond to their positions, then detects the target using voxels as the
input. In [67], Zhou et al. proposed the following idea: as the number of points in each voxel
is not necessarily equal, because the point cloud is not evenly distributed, random sampling
and normalization for the points will be required. Then, for each non empty voxel, several
VFE (voxel feature encoding) layers are used for local feature extraction, after which voxel-
wise features are obtained. Meanwhile, the 3D convolution middle layer (increasing the
receptive field and learning the representation of geometric space) is used for abstracting
features. Finally, the RPN (region proposal network) is used for object classification and
Multimedia Tools and Applications (2021) 80:29617–29641 29623
position regression. This framework is also the method most commonly used to extract point
cloud features without using Voxelization.
Voxel-based methods are also developing rapidly. In 2019, in [13], Chen et al. proposed
a two-stage 3D object detection framework utilizing voxel representation and original point
cloud. This method uses both the voxel representation and the original point cloud form, and
is also a two-stage method. More specifically, the first stage uses voxels as the data input,
and only uses a few layers of lightweight convolution operations to generate a small number
of high-quality initial predictions. The attention mechanism is then used to obtain the infor-
mation of each prediction frame by combining the coordinate information and the extracted
feature information. The second stage involves using RefinerNet for further optimization
and to make predictions regarding above-mentioned mixed features. The innovation of this
method is as follows: the use of the attention mechanism in the first stage can make each
point aggregate its position and context information, while at the same time, the content
of the voxel and raw data is used because voxel can roughly extract position information.
Moreover, the second stage uses raw data to obtain detailed information, which fuses the
advantages of both voxel and raw data.
Another method involves mapping the point cloud onto a two-dimensional plane. Although
a certain amount of information loss occurs during this mapping process because of the high
dimension of the point cloud, the time and space complexity of the point cloud based method
is higher than that of the 2D - based image input. At the same time, for the autonomous driv-
ing context, almost all objects are in the same plane, Thus, to a certain extent, the partial
loss of high-level information will not significantly affect the detection results.
Therefore, it also has practical significance to use the BVE to represent 3D scene in order
to detect 3D objects. In [65], the framework proposed by Yang et al. operates to get the chan-
nel along with the height and reflectivity through the point cloud. These authors then use
RetinaNet [33] to detect and locate objects. This method is a one - stage method, specifically
designed in terms of its input representation, network structure and model optimization to
achieve a balance of high accuracy and real-time performance.
In 2018, in [4], Beltran et al. projected the input point cloud to obtain a 3-channel Bev
map. The channels arranged in in order of height, intensity and density, where the density
is normalized and preprocessed. Faster R-CNN is then used to carry out 2D object-oriented
detection on Bev. Finally, 3D object-oriented detection is carried out offline combined with
the detection results and ground estimation.
In 2019, Zhou et al. proposed an originality projection method in [66], which differs
from the common method based on generating proposals from camera or bird’s-eye images.
These authors proposed a method that projects point clouds onto a cylindrical surface in
order to generate a forward-looking feature map with richer information. This method also
provides a new direction for future research.
In the current point cloud object detection, 3D voxel CNNs can generate high-quality
proposals, while pointnet based method can capture more accurate object location infor-
mation because of its flexible receptive field. Voxel based network and point cloud based
framework have their own advantages. If we combine their advantages, we will get a better
framework. In 2020, Shi et al. Adopted this idea and proposed a novel high-performance 3D
object detection framework, named point voxel RCNN (PV-RCNN) [47]. In this method,
3D voxel convolution neural network (CNN) and pointnet based method are deeply fused
to learn more discriminative point cloud features. Specifically: first, voxel the original
point cloud and input it into the encoder based on 3D sparse convolution to learn multi-
scale semantic features and generate 3D object proposal. Then, the learning voxel feature
29624 Multimedia Tools and Applications (2021) 80:29617–29641
volumes of multiple neural layers are summarized as a set of small key points through a new
voxel set abstraction module. Finally, according to the features of key points aggregated on
ROI grid points, the features specific to the proposal are learned and used for fine-grained
proposal refinement and confidence prediction.
We have introduced the advantages and disadvantages of MCIM in Section 1. Among these
methods the primary difference between 2D and 3D object detection method is that the
latter requires depth information to obtain spatial features.
The key to depth information is depth estimation, as depth estimation can sometimes
replace a pointcloud. Depth estimation can be obtained through either monocular or stereo
vision. In Table 2, we present the ablation study of D 4 LCN [37]. Generally speaking, the
error of supervised depth estimation algorithm is smaller than that of unsupervised depth
estimation algorithm, and the error of MVIM is smaller than that of MCIM. From Table 2,
we can see that the accuracy of D 4 LCN using unsupervised depth estimation method is
lower than that of using supervised, meanwhile, the accuracy of D 4 LCN using Stereo-
based method is lower than that of using Monocluar-based [37]. Therefore, better depth
estimation algorithms often can produce more accurate methods [9, 17, 22]. In yet another
example, DORN [17] can predict pixel depth with a combination of multi-scale features
with ordinal regression, therefore, [55] used DORN for depth estimation.
In [37], firstly, the depth map is estimated from RGB image and used as the input of two
branch network together with RGB image. Then, the depth guided filtering module is used
to fuse the two information of each residual block here. Finally, non maximum rejection
(NMS) first level detector is used to predict. Instead of converting the depth map of two-
dimensional image estimation into a pseudo-lidar representation, a new local convolution
network called depth guided dynamic depth wise expanded LCN (D 4 LCN ) is proposed,
It can automatically learn the convolution kernel and its receiving field from the image-
based depth map, so that different pixels of different images have different filters. D4lcn
overcomes the limitation of traditional 2D convolution and narrows the gap between image
representation and 3D point cloud representation. A large number of experiments show that
D 4 LCN is superior to the existing methods to a great extent.
E M H E M H
Comparative results on the Split1 of KITTI VALIDATION SET. E, M and H stands for Easy, Moderate and
Hard, respectively. M stands for Monocular Image
AP |R11 : 11-point Interpolated Average Precision (AP) metric
AP |R40 : the 40 recall positions-based metric
Multimedia Tools and Applications (2021) 80:29617–29641 29625
In 2019, Li et al. propoesed the MCIM named GS3D [31] (see Fig. 3). GS3D is divided
into the following steps: (1) 2D detection and direction prediction, which involves improv-
ing the Faster R-CNN framework and adding a branch of direction prediction. (2) Guidance
generation: based on the 2D detection results, the 3D bounding box of each 2D bounding
box can be predicted. More specifically, a relationship formula is obtained through analysis,
which is used as a guide for generating the 3D bounding box. (3) Surface feature extraction
using a given 3D bounding box (i.e. the guidance obtained in the previous step) is projected
onto the surface area to extract the features specified by the 3D structure. (4) Refinement
method: the discrete classification proposed by authors usually performs better than large-
scale regression, so the residual regression is converted into a 3D bounding box to improve
the the classification task performance.
In [36], Mousavian et al. proposed a method for 3D object and pose estimation with a
image. This method uses a deep convolutional neural network firstly to return relatively
stable 3D object attributes, then combines these estimations obtained from the 2D object
frame with the geometric constraints to generate an integral 3D bounding box. The first of
these networks uses a new hybrid discrete-continuous loss function for object positioning to
output the predicted 3D box; the results show that it is significantly better than L2 loss. The
second output regresses the 3D object dimension, compared to the alternatives, it contains
relatively small differences and can predict many object types with a high frequency. These
estimates can be combined with the geometric constraints imposed by the two-dimensional
bounding box, which can restore the relatively stable three-dimensional object to a stable
and precise three-dimensional object pose.
In [64], Bin Xu et al. proposed a framework for 3D object detection based on the end-
to-end multi-layer fusion of single images. The framework has two key components. The
first of these is used to generate 2D region proposal, while the second is used to generate
a the 2D position, direction, dimension and 3D position of object. Bin Xu et al. proposed a
multi-level fusion scheme using an independent module to estimate the parallax and calcu-
late 3D point clouds. First, the parallax information is encoded and represented with front
view features, and subsequently fused with RGB images to enhance the input. Second, the
features extracted from the original input are combined with the point cloud to improve
object detection. For 3D positioning, an additional module is introduced that directly pre-
dicts the position information of the point cloud and adds it to the above position prediction.
The algorithm can output 2D and 3D detection results using only one RGB image as input.
In the field of intelligent robotics, 3D object detection is also a challenging task. If the
accuracy of single perspective-based 3D object detection is improved, the development of
low-cost mobile robots will be promoted. Previously, the solution to this problem was too
complicated and the accuracy was low. In [26], in order to find solutions of these problems,
Jorgensen et al. proposed the SS3D method.
Fig. 3 The overview of GS3D: Based on the CNN model (2D+O subnet), the two-dimensional bounding
box and observation direction of the object can be obtained; then, the guidance can be generated with the
obtained two-dimensional box and projection matrix; finally, the refinement model utilizes features extracted
from the visible surface and 2D bounding box of the projected guidance to obtain a refined 3D box
29626 Multimedia Tools and Applications (2021) 80:29617–29641
Fig. 4 The overview of the method based on RGB and Depth image: first, the latest 2D object detection
technology is utilized for reducing the 3D search space; then, the multi-layer perceptron (MLP) is used to
learn the 3D space information; finally, this three-dimensional information is used to determine the direction,
position and score of the target’s bounding box
Multimedia Tools and Applications (2021) 80:29617–29641 29627
stereo images, and 3D box proposals generated by an energy minimization function, which
is encoded via depth and hand-crafted geometric features.
However, some researchers do not believe that the main factor affecting the detection
result is depth estimation.
In 2018, in [55], Yan Wang et al. used stereo image data as the input for object detec-
tion. In general, compared with PCM, MCIM or stereo image data result in greatly reduced
accuracy. This gap is usually attributed to insufficient depth estimation based on images.
However, Yan Wang et al. demonstrated that the main reason performance is affected is the
representation quality rather than the quality of depth estimation. Second, considering the
internal structure of the convolutional neural network, these authors propose a method that
convert the image-based depth map into a pseudo - LiDAR representation to imitate the
LiDAR point cloud in [55] for detection purposes. This representation can also be applied
to the existing different LiDAR-based detection methods.
The specific process of the method is as follows: Given a multi-view or monocular image,
it is first necessary to predict the depth map and then project it into a 3D point cloud in the
3D LiDAR coordinate system; this process is called a pseudo - LiDAR representation. Any
algorithm based on LiDAR data can take this pseudo LiDAR data as input. Two improved
aspects of [55] are proposed in this approach: In the first improvement, the pseudo - LiDAR
information is regarded as a 3D point cloud; in the second improvement, the author adopts
pseudo LiDAR information from a bird’s eye view (BEV). In particular, 2D bird - view
information is converted from 3D information; specifically, width and depth become spatial
dimensions, while height is recorded as the channel.
The other idea involves exploiting dense object constraints for final 3D bounding boxes
regression. In 2019, Li et al. proposed a 3D object detection method based on collecting
sparse and dense, semantic and geometric information from stereoscopic images [30]. The
Stereo R-CNN method proposed by Peiliang Li et al. is an extension of Faster R-CNN. It
expands the input of Faster R-CNN to ensure that targets in the left and right images in
the stereoscopic view can be detected and associated simultaneously. After the Region Pro-
posal Network (RPN), additional branches are added to predict sparse key points, angles,
and object dimensions. These three variables are combined with the 2D detection frame to
compute a rough 3D object bounding box. Subsequently, an area-based photometric method
is used to obtain an accurate 3D object bounding box. Depth input and 3D position super-
vision are not required in this method, but it is superior to all existing image-based full
supervision methods. In terms of the experiments, the KITTI dataset shows that this method
is superior to the most advanced MVIM in terms of both 3D detection and 3D location. The
main contributions of [30] are as follows: 1. The Stereo R-CNN method is proposed, which
can simultaneously detect and associate objects in a stereo image. 2. It designs a rough
3D frame generator that can use key points and three-dimensional frame constraints. 3. An
alignment method based on the optical measurement of dense areas is proposed, which can
ensure the location accuracy of the three-dimensional objects obtained.
In 2020, Sun et al. proposed a three stages method - Disp R-CNN. It can be divided into
three stage [54]: Firstly, a stereo mask r-cnn is used to detect the 2D boundary box and
instance segmentation mask in the input image. Then, the cropped ROI image is used as the
input of idispnet to estimate the parallax image of an example. Finally, the parallax image
of the example is transformed into the point cloud of the example and sent to the 3D detec-
tor for 3D boundary box regression. The main work of this paper is summarized as follows:
1. A new 3D object detection framework based on instance level disparity estimation is pro-
posed. 2. Provide supervised pseudo ground truth generation process for example disparity
29628 Multimedia Tools and Applications (2021) 80:29617–29641
estimation and guide it to learn the shape of the object in advance, which is conducive to
3D object detection.
We introduced the methods based on combination of point cloud and image in Section 2.4.
These combined methods can be divided into two primary categories(see Fig. 5): feature
fusion-based methods and 2D-driven 3D object detection methods.
We will first introduce the feature fusion-based methods. As shown in igure Fig. 6,
there are three types of fusion methods: early fusion, late fusion and deep fusion. Gener-
ally speaking, the input data are image, front view and Bev view. As the names suggest,
early fusion combines the various types of data before extracting features, while late fusion
first extracts the data features of each modality and then fuses them [53]. Moreover, deep
fusion first determines the element-wise mean for different types of data, then assesses the
element-wise mean many times for its fusion features. These three methods are also the
main methods of multimodal data feature extraction.
We will introduce a few typical methods below.
In 2016, [14], a 3D object detection network with multi-modal data was proposed by
Chen et al. The multi-modal input data includes RGB images, and radar bird’s eye view,
radar front view. The main procedure can be divided into the following steps: 1) Extract
features (a. Extract point cloud top view features; b. Extract point cloud front view fea-
tures; c. Extract image features). 2) Calculate the candidate area from the point cloud top
view features. 3) Integrate the candidate regions with the features obtained in 1) a, b and
c respectively: a. Project the top view candidate area into the front view and image; b.
Fig. 5 The overview of combination methods: these combined methods can be divided into two primary
categories - feature fusion-based methods and 2D-driven 3D object detection methods. Feature fusion-based
methods have been illustrated in Fig. 6. And 2D-driven 3D object detection methods can be describe as: first,
the target is detected in the 2D-image, after which 3D detection is carried out in the 3D data according to the
detection result
Multimedia Tools and Applications (2021) 80:29617–29641 29629
Fig. 6 Fusion method: there are three types of fusion methods: early fusion, late fusion and deep fusion.
Early fusion is combines the various types of data before extracting features; late fusion first extracts the
data features of each modality and then fuses them; moreover, deep fusion first determines the element-wise
mean for different types of data, then assesses the element-wise mean many times for its fusion features
Integrate into the same dimension through ROI pooling. 4) Integrate the integrated data
through the network.
In 2017, Xu et al. proposed the fusion method called Pointfusion [63]: Firstly, Pointnet
[41] is used to extract the global and point-wise feature of the original point cloud, while
RESNET [25] extracts the image features. The authors then propose two fusion methods,
Global Fusion and Dense Fusion. Global Fusion only fuses the global features of the point
cloud and the features of the image, then generates a 3D bounding box; moreover, Dense
Fusion operates by fusing the global features with the point-wise single point features and
image features, then generating a 3D bounding box. The final experiment revealed that the
detection results of the features of Dense Fusion are better than those of Global Fusion.
Based on voxelnet [67], Sindag et al. proposed a method named MVXnet. [51]. This
method aims to fuse the information from RGB and point cloud data. In [51], the authors
propose both early and late fusion methods. First of all, the pretraining of Imagenet [28] is
carried out, after which fine-tuning is also carried out to realize the 2D detection task. The
features are extracted from the last convolution layer of the 2D detection network. Next, the
advanced image features obtained can be encoded as prior semantic information to judge the
existence of goals. In the next step, based on point fusion or voxel fusion, the point or voxel
is projected onto the image, and the connected point features or voxel features respectively
29630 Multimedia Tools and Applications (2021) 80:29617–29641
are used to get the corresponding features. Finally, the detection results are obtained by the
voxel coding layer and 3D RPN (3D region recommendation network).
There are two reasons why MVX-Net chooses VoxelNet [67] as the base network for its
3D detection: (1) the input of VoxelNet is the original point cloud and does not require man-
ual features, so the deep learning method is highly appropriate; (2) A natural and effective
interface is provided to combine image features and 3D features.
In 2020, Pang et al. proposed a late fusion method [38]. The advantage of late fusion
is that the network structures of the two modes do not interfere with each other, and can
be trained and combined alone; but there is also a certain disadvantage that the fusion in
the decision-making level is actually the least of the original data information fusion. Since
each mode produces its own proposals, there is no relation between the confidence scores
of each proposals in the corresponding mode, so one of the problems to be solved is to
make the confidence scores of proposals generated between multiple modes have relation.
A et al. adopts geometric consistency and semantic consistency [38]. The framework of
CLOCs also has three main stages: (1) Proposals are proposed by 2D and 3D target detectors
respectively. (2) The proposals of the two modes are encoded as sparse tensors. (3) For non
empty elements, two-dimensional convolution is used for feature fusion.
In 2019, Wang et al. proposed a multi-sensor 3D object detection method, MCF3D, based
on multi-level complementary fusion [60]. MCF3D is an end-to-end network consisting of a
3D region suggestion subnet (3D RPN) and a second-level detector subnet. First, 2D images
and LIDAR point clouds as input, and CNN is used for the extraction of high-resolution
feature maps from images and BEVs. Next the 3D anchor is generated by BEV, and the
features of anchors from different perspectives are fused to generate a three-dimensional
and non-directional region proposal. Finally, features related to the suggestions are fused
and delivered to the detection subnet for size refinement, direction estimation and classifi-
cation. The most important aspect of this process is that even at each stage of the task, a
series of FPCIM are designed. (1) Pre-fusion is used in the input data processing stage to
selectively match and transform the camera and lidar data with spatial geometric constraints
and prior knowledge. (2) “Anchor-fusion” is used in the RPN stage to fully integrate the
anchor-guided features from different views to obtain high recall in the detection step. (3)
“Proposal-fusion” is used in the detection stage, facilitating the new attention model’s use
as a feature selector to refine the fusion by strengthening the ideal features and suppressing
invalid ones.
The above are based on the the FPCIM; the other algorithm idea is 2D-driven 3D object
detection. In brief, firstly, the target is detected in the 2D-image, after which 3D detection
is carried out in the 3D data according to the detection result [39, 58]. The advantage of
this method is that it reduces the scope of 3D detection, which improves the efficiency and
accuracy.
In 2017, [29] proposed a 2D-driven 3D object detection method. It may be imperfect
in terms of both speed and performance to use features (based on a histogram of point
coordinates) with simple fully connected networks for 3D box location and poseregression
purposes; however, it also provides a new direction for researchers.
In 2018, inspired by [29], Qi et al. proposed a 3D object detection method, referred to as
Frustum pointnets [39], which is applicable to both indoor and outdoor scenes. This method
combines 2D object detection and depth learning to locate 3D objects. Through directly
processing the point cloud, it can still effectively estimate the 3D boundary frame even
when the point cloud is strongly occluded or sparse. First, a 2D detector is used to detect the
region proposal of the object, after which the 3D frustum region corresponding to the 2D
Multimedia Tools and Applications (2021) 80:29617–29641 29631
region proposal is obtained through the camera parameters, Next, the object is segmented
in the frustum region, and the 3D bounding box is predicted by the segmentation results.
The biggest advantage of this method is that it not only reduces the amount of calculation
required, but also improves the accuracy.
Notably, the pioneering work of F-pointnet [39] can successfully 3D bounding box. How-
ever, the f-pointnet has two key imperfections: 1)it is not an end-to-end means of estimating
the direction frame; 2) as the final estimate depends on too few front sights, the object may
be segmented incorrectly.
In order to overcome this limitation, in 2019, Wang et al. proposed a new 3D object
detection method called frustum convnet (F-convnet) [58]. Similar to F-pointnet [39], this
method assumes that these region proposals can be obtained easily from the existing object
detectors and recognize the 3D objects corresponding to the pixels in each proposal. Given
a series of frustums generated from a region proposal, the most important design aspect
of F-ConvNet is that it aggregates point-wise features into frustum-level feature vectors.
Feature vectors with detection header are then transported to FCN as 2D feature mapping
for the end-to-end continuous estimation of the directed 3D box at an early stage within each
frustum. This method achieves excellent performance on the SUN-RGBD [52] and outdoor
KITTI datasets [19].
4 Evaluation
In this section, we will briefly evaluate and discuss the performance of the above mentioned
work. Before analyzing the performance of these methods, we have to introduce the dataset
and metrics of 3D object detection firstly. Then we compare their performance on the same
dataset through different metrics. Finally, we discuss the existing methods and future trends.
4.1 Datesets
For the research on 3D object detection in recent years, in this survey, they are divided into
four categories according to the input data type, but these methods also can be divided into
outdoor scenes and indoor scenes based on applicable scenes. Therefore, we first introduce
the most commonly used dataset according to the indoor scene and outdoor scene.
Most datasets of indoor scenes are collected from indoor, such as living room, bedroom,
office, meeting room etc. Common datasets of indoor scenes include RGB-D [52], ScanNet
[15], ModelNet40 [62], etc. We will focus on the most frequently used method datasets in
recent years.
Previously, RGB-D benchmarks suites such as NYU Depth V2 [50]) were an order of
magnitude smaller than modern benchmarks suites for color images (such as Pascal VOC
[16]). Although these small datasets have led to the initial progress of RGB-D scene under-
standing, the size limit is now becoming a key common bottleneck that takes research
to the more deeper level. Small datasets not only easily lead to over fitting of the algo-
rithm, but also can not support the current training data-hungry algorithm based color-based
recognition (such as [20]). Therefore, Song et al. proposed a RGB-D benchmark suite to
improve the technical level of all main scene understanding tasks [52]. The SUN RGB-D
dataset is captured by four different sensors and contains 10335 RGB-D images with a scale
29632 Multimedia Tools and Applications (2021) 80:29617–29641
similar to Pascal VOC [16]. It is a single-view dataset showing indoor scenes. The entire
dataset has dense annotations, including 146,617 two-dimensional polygons and 64,595
three-dimensional bounding boxes with precise object orientation, as well as the three-
dimensional room layout and scene category of each image with 37 object categories in total
(but 10 most common categories are used). The whole dataset contains 5K RGB-D images
with 5,285 images for training. However, if you would like to use point cloud as input of
model, you need convert the depth images into point cloud before model processing.
In [15], Dai et al. Proposed ScanNet, which is an RGB-D video dataset containing 1513
scenes, annotated with 3D camera pose, surface reconstruction and semantic segmentation.
In order to collect these data, Dai et al. designed an easy-to-use and scalable RGB-D capture
system that includes automatic surface reconstruction and crowdsourced semantic annota-
tion. ScanNetV2 is an RGB-D video dataset that can be used for semantic segmentation
and target detection tasks. It has a total of 1513 collected scene data and a total of 21 cate-
gories of objects. Among them, 1201 scenes are used for training and 312 scenes are used
for testing.
Matterport 3D [10] A large-scale RGB-D dataset. The dataset contains 10800 aligned
3D panoramic views (RGB + depth per pixel) and 194400 RGB + depth images from 90
building scale scenes.
Outdoor scenes including street, rural areas, and highways etc. Nowadays, due to the devel-
opment of automatic driving technology, outdoor scene datasets are widely used, especially
driving scenario data sets such as KITTI.
The KITTI dataset was jointly founded by the Karlsruhe Institute of Technology in Ger-
many and the Toyota Institute of Technology in the United States [19]. It is currently the
world’s largest computer vision algorithm evaluation data set in autonomous driving scenar-
ios. This dataset is used to evaluate the performance of computer vision technologies such
as stereo, optical flow, visual odometry, 3D object detection and 3D tracking in a vehicle
environment. KITTI contains real image data collected from scenes such as urban areas,
rural areas, and highways. Each image can contain up to 15 cars and 30 pedestrians, with
various degrees of occlusion and truncation. The entire data set is composed of 389 pairs of
stereo images and optical flow diagrams, 39.2 km visual ranging sequence and more than
200k 3D annotated object images, which are sampled and synchronized at a frequency of 10
Hz. In general, the original data set is classified into ‘Road’, ‘City’, ‘Residential’, ‘Campus’
and ‘Person’. For 3D object detection, label is subdivided into car, van, truck, pedestrian,
pedestrian (sitting), cyclist, tram, and misc.
In particular, the specific object detection dataset has 7,481 training and 7,518 test
frames, which are provided with sensor calibration information and annotated 3D boxes
around objects of interest. The specific object detection dataset has 7,481 training set and
7,518 testing frames, which include sensor calibration information and provide 3D boxes
marked around the object of interest.According to object size, occlusion and truncation
levels,The annotations are divided into “easy, moderate and hard” cases.
The Nuscence dataset consists of 1000 scenes [6], each of which is 20 seconds long and
contains a variety of scenarios. In each scene, there are 40 key frames, that is, there are 2
key frames per second, and the other frames are sweeps. Key frames are manually labeled,
and each frame has several annotations in the form of bounding box. Not only size, range,
but also category, visibility, and so on.
Multimedia Tools and Applications (2021) 80:29617–29641 29633
Waymo contains 3000 driving records, a total of 16.7 hours, and an average length of
about 20 seconds; 600000 frames, a total of about 25 million 3D bounding boxes, 22 million
2D bounding boxes, and a variety of automatic driving scenes [61].
PandaSet (https://scale.com/open-datasets/pandaset) The automatic driving data set is
collected in San Francisco. It contains 48000 camera images, 16000 lidar scans, 100+
scenes, and each scene has 8 seconds. It contains 28 annotation classes and 37 semantic sub-
division labels. The pandaset dataset detects 28 kinds of objects in each scene of more than
100 scenes, and most scenes are also divided into 37 kinds of semantic tags. For example,
bicycles and cars can be framed with cuboid lines. For LIDAR point cloud data, not every
point belongs to a certain target, so the semantic label of each point is accurately labeled by
the point cloud segmentation tool. Such delicate annotation also provides excellent data for
deep learning algorithm model.
4.2 Metrics
KITTI dataset provides a consistent baseline for comparison [19], Which includes three
metrics: average ac curacy of 2D bounding box (AP2D ), average accuracy of bounding box
(AP3D ) and average angle similarity (AOS). (AP3D ) and (AP2D ) are also the most common
metrics on other datasets
(1) AP2D : uses the AP calculation method of 2D detector, projects the 3D detection 2D
bounding box under the world coordinate system to the image coordinate system, and
calculates the AP value through IoU.
(2) AP3D : directly calculate the detection results and the IoU of ground truth in 3D space,
and calculates the AP value through IoU.
(3) APBEV : Map 3D detection results and gound truth to 2D aerial view, and calculates
the AP value through IoU. If the overlap area of the estimated truth box and the ground
truth box exceeds a certain threshold, the samples is considered to be true positive.
(4) AOS: Average Orientation Similarity uses the cosine similarity between the estimated
true azimuth and the ground true azimuth to weight the AP2D score to jointly measure
the two-dimensional detection and three-dimensional positioning performance.
4.3 Evaluation
Since most of the methods introduced in this survey are experimented on KITTI dataset
[19], we first introduce the methods of experiments on KITTI.
Due to different configurations of machines, the results of running are different. In order
to ensure the similarity of the equipment configuration as much as possible, we summarized
the public time on the official website as the time in the Table 3. Table 3 presents 3D
metrics on eleven methods for the car class on the KITTI test set with 0.7 3D IoU threshold,
Pedestrian and cyclist class with 0.5 3D IoU threshold. These methods include pointcloud
based mehthods, muti-view based method and combination of point cloud and image based
methods.
For the same method, the accuracy of simple level data is higher than that of difficult
level, because simple level data features are more obvious and easy to extract, and difficult
level data such as occlusion and fuzzy data will cause individual features to be insignificant.
The accuracy of a feature extraction network will be lower, which is in line with expec-
tations. At present, most studies have a high detection accuracy rate for simple-level data,
but the detection accuracy for difficult-level data is much lower than that for simple-level
29634 Multimedia Tools and Applications (2021) 80:29617–29641
Table 3 Comparison of 3D object detection methods on KITTI TEST SET for AP3D for car, pedestrian and
cyclists
E M H E M H E M H
PointRCNN [49] Lidar 0.1 85.94 75.76 68.32 49.43 41.78 38.63 73.93 59.6 53.59
Voxelnet [67] Lidar 0.22 77.47 65.11 57.73 39.48 33.69 31.51 61.22 48.36 44.37
Fast PointRCNN [13] Lidar 0.06 84.28 75.73 67.39 - 42.9 - - 59.36 -
Birdnet [4] Lidar 0.11 14.75 13.44 12.04 14.31 11.8 10.55 18.35 12.43 11.88
PIXOR [65] Lidar 0.1 81.7 77.05 72.95 - - - - - -
Fvnet [66] Lidar - 65.43 57.34 51.85 42.01 34.02 28.43 38.03 24.58 22.1
PV-RCNN [47] Lidar - 90.25 81.43 76.82 52.17 43.29 40.29 78.6 63.71 57.65
GS3D [31] M 2 4.47 2.90 2.47 - - - - - -
D 4 LCN [37] M 0.2 16.65 11.72 9.51 - - - - - -
Pseudo-LiDAR [55] Stereo 0.4 39.7 26.7 22.3 29.8 22.1 18.8 3.7 2.8 2.1
Stereo R-CNN [11] Stereo 0.3 49.23 34.05 28.39 - - - - - -
Disp R-CNN [54] Stereo - 59.58 39.34 31.99 - - - - - -
MVX-Net [51] Lidar+M - 83.2 72.7 65.2 - - - - - -
F-POINTNET [39] Lidar+M 0.17 81.2 70.39 62.19 51.21 44.89 40.23 71.96 56.77 50.39
F-ConvNet [58] Lidar+M 0.47 85.88 76.51 68.08 52.37 45.61 41.49 79.58 64.68 57.03
M stands for modality, T stands for time. E, M and H stands for Easy, Moderate and Hard, respectively. L
and Mon stands for Lidar and Monocular Image respectively
data. Therefore, how to improve the detection accuracy of difficult-level data will become
a future study. A trend, because many objects in real life are dynamic, the collected data
rarely has simple data. Similarly, the size of the target has a great influence on the detec-
tion result, and the reason is similar to the simple level and difficult level data. In real space
coordinates, if you transform the coordinates on the computer, there will be loss of infor-
mation. For example, if you collect 3D data from different angles of a car, due to equipment
limitations, the extracted objects will inevitably be lost. In practical applications, This loss
may have a relatively large impact, especially in the field of autonomous driving, which has
extremely high safety requirements. Therefore, how to reduce the loss of this information is
also a very important direction for improvement.
For different methods, we can see that the point PCM has a greater improvement than the
stereo vision method. Among them, PointRcnn is a two-stage method [49], and the second-
stage refine operation can make the accuracy higher. However, the accuracy of pedestrian
detection is not as high as F-pointnet, which is based on point clouds and images. This
can be explained that the size of pedestrian is very small, and the image can capture more
pedestrian details than the point cloud to help 3D detection.
Also as a two-stage detection method, fast pointrcnn takes voxelized point clouds as
input, since its backbone network is composed of two-dimensional and three-dimensional
convolutions [13], it achieves the same high efficiency as PIXOR [65] and also Even higher
performance than VoxelNet [67]. Compared with Pointrcnn that directly takes the point
cloud as input, fast pointrcnn uses RefinerNet to further optimize the results to make up
for the information lost in the voxelization of the point cloud. Therefore, in the vehicle,
Multimedia Tools and Applications (2021) 80:29617–29641 29635
the accuracy of fastpointrcnn is close to that of pointrcnn. The category is even higher than
pointrcnn.
Birdnet and fvnet are methods of projecting a point cloud onto a two-dimensional grid
[4, 66]. Most of the existing methods are based on the bird’s-eye view model, while fvnet is
based on the front view (Front View). Since fvnet projects the point cloud onto the cylindri-
cal surface to generate a forward-looking feature map that retains rich information, the loss
of point cloud information is minimal, and the accuracy is much higher than that of Birdnet
based on a bird’s-eye view. However, it is still lower than the direct use based on Birdnet.
Point cloud input method, because the point cloud is projected onto a plane, there will be
more or less information loss.
Pointnet is a typical method that takes the original point cloud as input, and has been
applied to single object classification and semantic segmentation [40]. However, in [41],
only targets in small indoor scenes are detected, indoor focused methods could find it hard
to apply to sparse and large-scale point cloud from LiDAR scans. Based on Pointnet, F-
POINTNET uses point cloud and image as input for 3D target detection [39] . Intuitively
speaking, it uses the results of the detection on the image to reduce the detection range in
the point cloud, so as to achieve the large-scale point cloud. Good results. Since F-pointnet
is one-stage and only uses the 2D region proposal of the image detection result, it does
not integrate image information for detection in the point cloud. Therefore, the detection
accuracy is slightly inferior to two-stage PointRcnn. And MVX-NET that integrates multi-
modal information [51] .
And Frustum-Convnet is improved on F-pointnet [58]. It uses the acquired viewing
frustum to group local points. F-ConvNet aggregates point-wise into frustum-wise feature
vectors, so the point cloud feature extraction is more detailed, but the network is more com-
plex, so compared to F-pointnet, the target detection accuracy is higher, but the time required
Longer than F-pointnet. Finally, the detection accuracy of the method using multi-view pic-
tures as input is much lower than the method using point cloud input. This is also because
pictures cannot better represent the spatial information of richer objects.
In [55], firstly, Wang et al. estimated pseudo-LiDAR data, and then used F-pointnet for
detection, although the accuracy rate is low, it is a new idea for cost reduction. because it is
very expensive to use radar to collect data now.
For the Disp R-CNN method based on stereo [54], the accuracy is significantly higher
than that based on Pseudo-LiDAR [55] , This is because Disp R-CNN has a pseudo Ground-
truth generation process, which provides supervision for the example disparity estimation
network and guides it to learn the object shape in advance, which is conducive to 3D object
detection.
In Table 3, PV-RCNN has the best performance in these methods [47], because PV-
RCNN not only uses voxel based CNNs to generate high-quality proposals, but also uses
pointnet based method to capture more accurate target location information. As a method
based on Monocular image, D 4 LCN proposes a new local convolution network, called
depth guided dynamic extended LCN. Its accuracy and speed are significantly better than
GS3D [31]. The main reason is that D 4 LCN overcomes the limitations of traditional 2D
convolution and narrows the gap between image representation and 3D point cloud repre-
sentation. Since methods based on both point cloud and image needs to detect the image
first, and then detect object from the point cloud, it takes a long time. Therefore, if you want
to detect the result faster, it is necessary to reduce redundant operations as much as possible,
and do not use too deep network for extraction features.
29636 Multimedia Tools and Applications (2021) 80:29617–29641
Since the detection accuracy of the MCIM is much lower than the method with lidar input
above, we will compare these methods separately. At the same time, because MV3D and
pointfusion are compared on the KITTI evaluation set, for convenience, we will compare
them on the same KITTI evaluation set, and compare the accuracy of IoU of 0.5 and IoU of
0.7 respectively. See Table 4.
When the IoU is 0.5, MonoFusion has the highest accuracy among these MCIM [64], and
even exceeds the 3DOP input by Stereo. This is because MonoFusion uses an independent
parallax estimation and 3D point cloud computing module, MonoFusion will The dispar-
ity information is encoded with front view feature representation and fused with the RGB
image to enhance the input, and the original input feature is combined with the point cloud
feature, which not only improves the efficiency of target detection but also extracts the three-
dimensional information in the image well. At the same time, GS3D only uses the visual
features of the picture and does not deeply add 3D information [31], while Deep3DBox
uses two deep convolutional neural networks to return relatively stable 3D object attributes
(direction, length, width, and height), and then combine these attributes with 2D The geo-
metric constraints of the target bounding box are combined to generate a complete 3D
bounding box without adding visual features. Therefore, the accuracy rate is lower than the
previous two.
When the IoU is 0.7, the accuracy of SS3D is slightly higher than other methods [26].
This is because SS3D uses 6 Task Nets to regress Class Score, 2D BBox, Distance, Orienta-
tion, Dimensions, and 3D Corners, which are equivalent to features The extraction is more
refined, so its detection accuracy is higher than other methods.
According to the comparison result, we can see that the method with input data with
pointcloud is far more accurate than the method with input data with only images. Among
them, MV3D is to project the point cloud data on the bird’s-eye view and the front view,
extract the features separately, and merge them with the features of the image [14] . The final
3D object detection is equivalent to losing the point cloud when projecting the point data.
Pointfusion directly uses the pointnet as the direct input of the point cloud as the feature
extraction network of the point cloud [63], so the loss of spatial information is less than
MV3D. Therefore, the final detection accuracy of MV3D is not as good as Pointfusion.
It can be seen from the time that the MCIM consumes more time than the MVIM. This
is because the stereo image needs to process more information, so it takes a long time.
E M H E M H
M stands for modality, T stands for time. E, M and H stands for Easy, Moderate and Hard, respectively. M
stands for Monocular Image. L stands for Lidar
Multimedia Tools and Applications (2021) 80:29617–29641 29637
In this section, we will introduce the existing problems and future development directions
of 3D target detection, and put forward our views and suggestions. According to the intro-
duction of this article, there are many methods for 3D target detection. This article only
introduces a few representative methods. However, the current 3D target detection meth-
ods are updated faster and faster, and most of them are applied in the field of autonomous
driving. Therefore, based on the research and analysis of methods in recent years, we put
forward some problems and opinions in the application of 3D target detection in the field of
autonomous driving.
(1) Most of the existing methods are only multi-modal fusion of point cloud data and
image data. However, for autonomous vehicles, there must be many sensors. If other
modal information such as sound is also used for fusion for object detection, the
accuracy of target detection will also be improved.
(2) Reduce the cost of data collection. The cost of collecting point cloud data by existing
radar sensors is relatively expensive, while the cost of the camera to collect images is
very low. If the image can be used for 3D target detection, and the detection accuracy
is close to the PCM, Then it can not only guarantee the detection accuracy, but also
reduce the cost, which will be very meaningful. Specifically, the existing stereo vision
images are two images, all of which can use multiple images for 3D detection; you can
also use a method similar to Pesudo-liadr, using multiple sets of images to reconstruct
a 3D point cloud, and then detect the object [42, 55].
(3) Change the detection scheme according to the scene. For example, there is no need to
detect pedestrians on highways without pedestrians, and more sophisticated detection
schemes are used on complex streets.
(4) Real-time performance needs to be improved. According to our analysis, it can be seen
that when the detection accuracy is high, the time required is longer. In practical appli-
cations, the real-time performance of the algorithm is also a very important evaluation
criterion. How to weigh the relationship between accuracy and real-time performance
is very important.
(5) Although the existing KITTI data set has a large amount of data, most of the scenes
are on the street and are relatively single, but if you want to simulate a real driving
scene, it is far from enough. The data for autonomous driving should be diversified,
such as collecting data in different weather, different lighting, and different regions.
For this difference, we can find out the common nature of the target in different situa-
tions, so that the method can be applied to all kinds of situations. Another idea is take
different strategies according to different situations. Wang et al. [56] has proposed a
solution to this problem. For the problem of insufficient data, there is a way is to intro-
duce unsupervised or weak supervised methods to alleviate the problem of insufficient
data [1].
29638
Method class avg inst-ance avg air-plane bag cap car chair ear-phon-e guit-ar kni-fe amp lap-top mot-or-bike
PointNet [40] 80.4 83.7 83.4 78.7 82.5 74.9 89.6 73.0 91.5 85.9 80.8 95.3 65.2
PointNet++ [41] 81.9 85.1 82.4 79.0 87.7 77.3 90.8 71.8 91.0 85.9 83.7 95.3 71.6
Splatnet [53] 83.7 85.4 83.2 84.3 89.1 80.3 90.7 75.5 92.1 87.1 83.9 96.3 75.6
Class avg represents the average accuracy of all classes, instance avg represents the average accuracy of all instances
Multimedia Tools and Applications (2021) 80:29617–29641
Multimedia Tools and Applications (2021) 80:29617–29641 29639
(6) With the rapid development of deep learning, 3D object detection can be combined
with other directions, such as Gcn(Graph Neural Network), Scene graph or object
tracking. In recent years, there are a lot of research combined the 3D object detection
with these directions [12, 32, 48].
5.2 Conclusion
In this paper, we introduced background of 3D object detection in an easy way. In the main
part, we classify the existing methods of 3D object detection based on computer vision
in detail, and select the most classic methods in each category for introduction, including
image-based, Point cloud-based and fusion of point cloud and image methods, then we com-
pared and summarized these methods. Finally, we also put forward the existing technical
difficulties and shortcomings, and some of our own views and future trends.
Today’s 2D object detection technology has reached a difficult breakthrough stage.
However, because of limitation of computation capacity of machine and high cost of data
acquisition, the development of 3D object detection is in the initial stage. In the future,
the popularity of intelligent robots and autonomous driving cannot do without 3D object
detection, meanwhile, 3D object detection will be a very popular direction.
Acknowledgements This research was supported in part by the National Natural Science Foundation of
China under grant agreements Nos. 61973250, 61802306, 61973249, 61702415, 61902318, 61876145. Sci-
entific research plan for servicing local area of Shaanxi province education department: 19JC038, 19JC041.
Key Research and Development Program of Shaanxi (Nos.2019GY-012, 2021GY-077).
References
1. Ahmed M (2020) Density based clustering for 3d object detection in point clouds. In: Conference on
computer vision and pattern recognition
2. Bay H, Tuytelaars T, Van Gool LJ (2006) Surf: Speeded up robust features
3. Belongie S (2014) microsoft coco: Common objects in context
4. Beltran J, Guindel C, Moreno FM, Cruzado D, Garcia F, De La Escalera A (2018) Birdnet: a 3d object
detection framework from lidar information
5. Bo L, Yan J, Wei W, Zheng Z, Hu X (2018) High performance visual tracking with siamese region
proposal network. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
6. Caesar H, Bankiti V, Lang AH, Vora S, Beijbom O (2020) nuscenes: A multimodal dataset for
autonomous driving. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR)
7. Cai Z, Vasconcelos N (2017) Cascade r-cnn: Delving into high quality object detection
8. Chang J, Chen Y (2018) Pyramid stereo matching network. In: 2018 IEEE/CVF Conference on computer
vision and pattern recognition, pp 5410–5418
9. Chang J, Chen Y (2018) Pyramid stereo matching network
10. Chang A, Dai A, Funkhouser T, Halber M, Niessner M, Savva M, Song S, Zeng A, Yinda Z (2017)
Matterport3D: Learning from RGB-D data in indoor environments. International Conference on 3D
Vision (3DV)
11. Chen X, Kundu K, Zhu Y, Ma H, Fidler S (2018) 3d object proposals using stereo imagery for accurate
object class detection. IEEE Trans Pattern Anal and Mach Intell
12. Chen J, Lei Bn, Song Q, Ying H, Chen D, Wu J (2020) A hierarchical graph network for 3d object
detection on point clouds. CVPR: 389–398
13. Chen Y, Liu S, Shen X, Jia J (2019) Fast point r-cnn
14. Chen X, Ma H, Wan J, Li B, Xia T (2016) Multi-view 3d object detection network for autonomous
driving
15. Dai A, Chang AX, Savva M, Halber M, Funkhouser T, Nießner M (2017) Scannet Richly-annotated 3d
reconstructions of indoor scenes. In: 2017 IEEE Conference on computer vision and pattern recognition
(CVPR), pp 2432–2443
29640 Multimedia Tools and Applications (2021) 80:29617–29641
16. Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2010) The pascal visual object classes
(voc) challenge. Int J Comput Vis 88(2):303–338
17. Fu H, Gong M, Wang C, Batmanghelich K, Tao D (2018) Deep ordinal regression network for monocular
depth estimation
18. Fu CY, Liu W, Ranga A, Tyagi A, Berg AC (2017) Dssd: Deconvolutional single shot detector
19. Geiger A, Lenz P, Urtasun R (2012) Are we ready for autonomous driving? the kitti vision benchmark
suite. In: IEEE conference on computer vision & pattern recognition
20. Girshick R, Donahue J, Darrell T, Malik J (2013) Rich feature hierarchies for accurate object detection
and semantic segmentation
21. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection
and semantic segmentation, pp 580–587
22. Godard C, Aodha OM, Brostow GJ (2016) Unsupervised monocular depth estimation with left-right
consistency
23. Godard C, Aodha OM, Brostow GJ (2017) Unsupervised monocular depth estimation with left-right
consistency. In: Computer Vision & Pattern Recognition
24. He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual
recognition. IEEE Trans Pattern Anal Mach Intell
25. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition, pp 770–778
26. Jrgensen E, Zach C, Kahl F (2019) Monocular 3d object detection and box fitting trained end-to-end
using intersection-over-union loss
27. Ke NY, Sukthankar R (2004) Pca-sift: a more distinctive representation for local image descriptors. In:
Proceedings of the 2004 IEEE computer society conference on computer vision and pattern recognition
28. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural
networks. Commun ACM 60(6):84–90
29. Lahoud J, Ghanem B (2017) 2d-driven 3d object detection in rgb-d images. In: IEEE International
conference on computer vision
30. Li P, Chen X, Shen S (2019) Stereo r-cnn based 3d object detection for autonomous driving
31. Li B, Ouyang W, Sheng L, Zeng X, Wang X (2019) Gs3d: An efficient 3d object detection framework for
autonomous driving. In: Proceedings of the IEEE conference on computer vision and pattern recognition,
pp 1019–1028
32. Liang M, Yang B, Zeng W, Chen Y, Hu R, Casas S, Urtasun R (2020) Pnpnet: End-to-end perception and
prediction with tracking in the loop. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pp 11550–11559
33. Lin TY, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. IEEE Trans
Pattern Anal Mach Intell PP(99):2999–3007
34. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: Single shot multibox
detector. Lect. Notes Comput. Sci, pp 21–37
35. Mayer N, Ilg E, Hausser P, Fischer P, Brox T (2016) A large dataset to train convolutional networks for
disparity, optical flow, and scene flow estimation. In: 2016 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR)
36. Mousavian A, Anguelov D, Flynn J, Kosecka J (2017) 3d bounding box estimation using deep learning
and geometry. In: 2017 IEEE Conference on computer vision and pattern recognition (CVPR)
37. Ng M, Huo Y, Yi H, Wang Z, Shi J, Lu Z, Luo P (2020) Learning depth-guided convolutions for monoc-
ular 3d object detection. In: 2020 IEEE/CVF conference on computer vision and pattern recognition
workshops (CVPRW)
38. Pang S, Morris D, Radha H (2020) Clocs: Camera-lidar object candidates fusion for 3d object detection
39. Qi CR, Liu W, Wu C, Su H, Guibas LJ (2017) Frustum pointnets for 3d object detection from rgb-d data
40. Qi CR, Su H, Mo K, Guibas LJ (2016) Pointnet: Deep learning on point sets for 3d classification and
segmentation
41. Qi CR, Yi L, Su H, Guibas LJ (2017) Pointnet++: Deep hierarchical feature learning on point sets in a
metric space
42. Qian R, Garg D, Wang Y, You Y, Chao WL (2020) End-to-end pseudo-lidar for image-based 3d object
detection. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR)
43. Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In: Proceedings of the IEEE conference
on computer vision and pattern recognition, pp 7263–7271
44. Redmon J, Farhadi A (2018) Yolov3: An incremental improvement
45. Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: Towards real-time object detection with region
proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
46. Shafiee MJ, Chywl B, Li F, Wong A (2017) Fast yolo: A fast you only look once system for real-time
embedded object detection in video
Multimedia Tools and Applications (2021) 80:29617–29641 29641
47. Shi S, Guo C, Li J, Wang Z, Li H (2020) Pv-rcnn: Point-voxel feature set abstraction for 3d object
detection. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR)
48. Shi W, Rajkumar R (2020) Point-gnn: Graph neural network for 3d object detection in a point cloud. In:
2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR)
49. Shi S, Wang X, Li H (2018) Pointrcnn: 3d object proposal generation and detection from point cloud
50. Silberman N, Hoiem D, Kohli P, Fergus R (2012) Indoor segmentation and support inference from rgbd
images. In: European conference on computer vision
51. Sindagi VA, Zhou Y, Tuzel O (2019) Mvx-net: Multimodal voxelnet for 3d object detection
52. Song S, Lichtenberg SP, Xiao J (2015) Sun rgb-d: A rgb-d scene understanding benchmark suite
53. Su H, Jampani V, Sun D, Maji S, Kalogerakis E, Yang M-H, Kautz J (2018) Splatnet: Sparse lattice
networks for point cloud processing, pp 2530–2539
54. Sun J, Chen L, Xie Y, Zhang S, Jiang Q, Zhou X, Bao H (2020) Disp r-cnn: Stereo 3d object detection
via shape prior guided instance disparity estimation
55. Wang Y, Chao W-L, Garg D, Hariharan B, Campbell M, Weinberger KQ (2018) Pseudo-lidar from
visual depth estimation: Bridging the gap in 3d object detection for autonomous driving
56. Wang Y, Chen X, You Y, Li E, Hariharan B, Campbell M, Weinberger KQ, Chao WL (2020) Train in
Germany, test in the USA Making 3d object detectors generalize. In: 2020 IEEE/CVF conference on
computer vision and pattern recognition (CVPR)
57. Wang X, Han TX, Yan S (2010) An hog-lbp human detector with partial occlusion handling. In: 2009
IEEE 12th international conference on computer vision
58. Wang Z, Jia K (2019) Frustum convnet: Sliding frustums to aggregate local point-wise features for
amodal 3d object detection
59. Wang X, Shrivastava A, Gupta A (2017) Hard positive generation via adversary for object detection. In:
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
60. Wang J, Zhu M, Sun D, Bo W, Wei H (2019) Mcf3d: Multi-stage complementary fusion for multi-sensor
3d object detection. IEEE Access PP(99):1–1
61. LLC W (2019) Waymo open dataset: An autonomous driving dataset
62. Wu Z, Song S, Khosla A, Yu F, Zhang L, Tang X, Xiao J (2015) 3d shapenets: A deep representation for
volumetric shapes. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR)
63. Xu D, Anguelov D, Jain A (2017) Pointfusion: Deep sensor fusion for 3d bounding box estimation
64. Xu B, Chen Z (2018) Multi-level fusion based 3d object detection from monocular images. In: 2018
IEEE/CVF conference on computer vision and pattern recognition (CVPR)
65. Yang B, Luo W, Urtasun R (2018) Pixor: Real-time 3d object detection from point clouds
66. Zhou J, Lu X, Tan X, Shao Z, Ding S, Ma L (2019) Fvnet: 3d front-view proposal generation for real-time
object detection from point clouds. arXiv:1903.10750
67. Zhou Y, Tuzel O (2017) Voxelnet: End-to-end learning for point cloud based 3d object detection
68. Zhou H, Yuan Y, Shi C (2009) Object tracking using sift features and mean shift. Computer Vision &
Image Understanding 113(3):345–352
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.