0% found this document useful (0 votes)
16 views

Aiav Unit 2 Notes

The document discusses the importance of perception in autonomous driving, highlighting the role of various sensors and datasets in enhancing object detection and segmentation. It covers key detection methods, challenges in tracking, and advancements in convolutional neural networks (CNNs) for object detection and semantic segmentation. Additionally, it addresses stereo vision and optical flow techniques for depth estimation and motion analysis, emphasizing the need for robust algorithms in complex driving environments.

Uploaded by

Rohan S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Aiav Unit 2 Notes

The document discusses the importance of perception in autonomous driving, highlighting the role of various sensors and datasets in enhancing object detection and segmentation. It covers key detection methods, challenges in tracking, and advancements in convolutional neural networks (CNNs) for object detection and semantic segmentation. Additionally, it addresses stereo vision and optical flow techniques for depth estimation and motion analysis, emphasizing the need for robust algorithms in complex driving environments.

Uploaded by

Rohan S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Perception in Autonomous Driving

Introduction

Autonomous vehicles operate in complex environments where accurate perception is crucial.


Sensors such as cameras, LiDAR, radar, and ultrasonic sensors help in gathering
environmental data. Among these, cameras and LiDAR are the most informative. Computer
vision plays a key role in processing visual data for autonomous driving. Since the 1980s,
significant advancements have been made in perception, yet it remains a major challenge.

Datasets

Large datasets are essential for improving perception algorithms in autonomous vehicles.
These datasets help in quantitative evaluation, exposing weaknesses, and enabling fair
comparisons. Common datasets for computer vision tasks include those for image
classification, semantic segmentation, optical flow, stereo vision, and tracking.

For autonomous driving, key datasets include:

• KITTI: A dataset created by Karlsruhe Institute of Technology and Toyota


Technological Institute at Chicago, featuring real-world street scenes collected using
multiple sensors, including GPS, LiDAR, and cameras. It contains data for stereo
vision, optical flow, visual odometry, object detection, object tracking, and road
parsing.
• Cityscapes: A dataset focused on urban scene segmentation.

Newer datasets include:

• Audi Autonomous Driving Dataset (A2D2)


• nuScenes
• Berkeley DeepDrive
• Waymo Open Dataset
• Lyft Level 5 Open Data

These datasets provide high-precision 3D geometry, real-world scenarios, and diverse


perception tasks, making them essential for advancing autonomous driving research.
Detection and Segmentation in Autonomous Driving

Detection

Autonomous vehicles must detect various objects on the road, including cars, pedestrians,
obstacles, and lane markers. Object detection involves three main stages:

1. Preprocessing of input images


2. Region of interest detection
3. Classification of detected objects

Challenges in detection include variations in position, size, shape, orientation, and


appearance. Additionally, detection must be performed in real time for safe navigation.

Key Detection Methods:

• Histogram of Oriented Gradients (HOG) + Support Vector Machine (SVM)


(Dalal & Triggs, 2005): Uses sliding windows to extract HOG features and classify
objects with a linear SVM.
• Deformable Part Model (DPM) (Felzenszwalb et al.): Splits objects into smaller
parts to handle non-rigid shapes and uses latent SVM for detection.
• LiDAR-based Detection: While LiDAR performs well for car detection, it struggles
with pedestrians and cyclists, highlighting the need for sensor fusion.

Pedestrian detection is particularly critical for safety, as human behavior is unpredictable and
often occluded. Modern detectors rely on convolutional neural networks (CNNs), which are
discussed in the next chapter.

Segmentation

Semantic segmentation enhances object detection by assigning a class label to each pixel,
providing a structured understanding of the environment.

Traditional Approach:

• Conditional Random Fields (CRF): A graphical model where nodes (pixels) are
assigned labels based on extracted features, ensuring spatial smoothness and object
coherence.
• Challenges: CRF struggles with long-range dependencies and computational
efficiency.

Advancements:

• Fully connected CRFs with pairwise potentials improve inference speed.


• Algorithms incorporating object class co-occurrence enhance accuracy.
• Deep learning approaches (discussed in the next chapter) improve segmentation
performance using multi-scale features and contextual reasoning.
Stereo, Optical Flow, and Scene Flow in Autonomous Driving

Stereo and Depth Perception

Autonomous vehicles require 3D spatial information for navigation. While LiDAR provides
precise but sparse depth data, stereo cameras offer dense visual information.

• Stereo vision mimics human binocular vision by capturing images from two slightly
different angles and solving a correspondence problem to estimate depth.
• Feature-based methods use distinctive features (e.g., SIFT, SURF) for matching but
provide sparse results.
• Area-based methods use spatial smoothness constraints to compute dense disparity
maps but require more computation.
• Global methods (e.g., Semi-Global Matching (SGM)) optimize disparity estimation
by minimizing energy functions, improving accuracy and efficiency.
• Deep learning-based methods now achieve the best performance in stereo matching
(discussed in the next chapter).

The depth (z) of an object is derived using the formula:

z=Bdfz = \frac{B d}{f}

where B is the camera baseline, d is the disparity, and f is the focal length.

Optical Flow

Optical flow estimates 2D motion by tracking intensity changes between consecutive images.
Unlike stereo vision, which captures images simultaneously, optical flow must account for:

• Motion variations due to lighting changes, reflections, and transparency.


• The aperture problem, where motion ambiguity arises due to limited local
observations.

To improve robustness, alternative cost functions have been introduced to replace the
quadratic penalty in classical methods.

Scene Flow

Autonomous vehicles need 3D motion estimation rather than just 2D optical flow. Scene
flow extends optical flow by using two consecutive stereo image pairs to estimate both:

• 3D positions of points.
• 3D motion between time intervals.

The KITTI Scene Flow 2015 benchmark evaluates methods for accurate 3D motion
estimation, crucial for vehicle navigation and obstacle avoidance.
Object Tracking in Autonomous Vehicles

Tracking Overview

Tracking estimates an object's location, speed, and acceleration over time, allowing
autonomous vehicles to maintain safe distances and predict movement. This is particularly
challenging for pedestrians and cyclists due to sudden direction changes.

Challenges in tracking:

• Occlusion (partial/full obstruction of objects).


• Appearance similarity among objects of the same class.
• Variability in appearance due to lighting, pose, and articulation.

Bayesian Filtering Approach

Tracking is traditionally modeled as a sequential Bayesian filtering problem with two main
steps:

1. Prediction: The object's state is estimated based on past motion.


2. Correction: The state estimate is refined using new sensor observations.

A commonly used method is the Particle Filter, but its recursive nature makes recovery from
missed detections difficult.

Alternative Approaches

• Energy minimization: Finds the optimal object trajectory by enforcing motion


smoothness and appearance consistency. However, the large number of possible
object hypotheses makes this approach computationally expensive.
• Tracking-by-detection: Detects objects in consecutive frames and links them, but
faces missed detections and false positives from object detectors.

Markovian Decision Process (MDP) for Tracking

MDP-based tracking defines object states and transitions:

• Active: Object detected.


• Tracked: Object confirmed as valid.
• Lost: Object temporarily undetected but might reappear.
• Inactive: Object lost for too long, removed from tracking.

MDP Tracking Algorithm:

• Uses an SVM classifier to validate detections.


• Applies a tracking-learning-detection model to maintain appearance consistency.
• Continuously updates the object's bounding box template for re-identification.
• Moves objects between states based on learned transition and reward functions.
CNNs and Object Detection in Autonomous Driving

4.1 Convolutional Neural Networks (CNNs)

CNNs are a type of deep neural network that use convolution as the primary computational
operation. They were first introduced by LeCun et al. in 1988, inspired by the visual cortex's
structure. CNNs excel in computer vision tasks due to:

• Local connectivity: Neurons only connect to nearby neurons within a receptive


field.
• Weight sharing: Spatially shared weights reduce the number of parameters, making
CNNs efficient.
• Translation invariance: CNNs learn patterns irrespective of their location in the
image.

CNNs revolutionized computer vision, with models like AlexNet (2012) leading to state-of-
the-art autonomous driving perception systems.

4.2 Object Detection in Autonomous Driving

Traditional Object Detection

Early detection methods relied on hand-crafted features and structured classifiers, but these
struggled with large data volumes and object variations.

CNN-Based Object Detection

Girshick et al. introduced R-CNN, proving CNNs outperform traditional methods. Faster R-
CNN improved detection by using a Region Proposal Network (RPN) for generating
potential object locations.

Faster R-CNN Pipeline:

1. RPN generates region proposals by scanning feature maps using anchor boxes of
different sizes (e.g., 128×128, 256×256, 512×512).
2. ROI pooling refines proposals, mapping them to a fixed-size feature map.
3. Final classification and bounding box regression predict object type and precise
location.

Proposal-Free Algorithms

Some models avoid the region proposal step for real-time performance:

• SSD (Single Shot MultiBox Detector): Uses multiple convolutional layers to detect
objects of varying sizes in a single pass.
• YOLO (You Only Look Once): Directly predicts object locations and classes in one
forward pass, achieving high speed.
While proposal-free methods are faster, Faster R-CNN still achieves the highest accuracy in
benchmarks like PASCAL VOC. However, it struggles with small, occluded objects in
datasets like KITTI.

Multi-Scale CNN (MS-CNN)

To handle objects of varying sizes, MS-CNN introduces:

• A "trunk" CNN for feature extraction.


• "Branches" with deconvolution layers and ROI pooling to refine small object
detection.
• Improved performance in detecting pedestrians and cyclists compared to Faster R-
CNN.

Anchor-Free Object Detection (FCOS)

Recent methods like FCOS (Fully Convolutional One-Stage Detector) remove predefined
anchor boxes, making detection more flexible.

• Uses a feature pyramid network to extract multi-scale features.


• Employs shared classification and regression heads across feature levels.
• Introduces a center-ness branch to suppress inaccurate detections.
• Achieves state-of-the-art accuracy with lower memory usage.

Conclusion

CNN-based object detection plays a vital role in autonomous driving.

• Faster R-CNN is highly accurate but computationally expensive.


• SSD and YOLO offer real-time detection but trade-off some accuracy.
• MS-CNN and FCOS improve detection for small or occluded objects.

Ongoing advancements in multi-scale detection and anchor-free methods continue to


refine object detection for autonomous driving perception.
Semantic Segmentation in Autonomous Driving

4.3 Semantic Segmentation

Semantic segmentation is crucial in autonomous driving perception, as it helps identify road


surfaces, obstacles, and other scene elements at the pixel level.

Fully Convolutional Networks (FCN)

• FCNs transform traditional CNNs (e.g., VGG-19) by removing the softmax layer
and replacing fully connected layers with 1×1 convolutions.
• They allow input of any size and predict per-pixel labels for segmentation.
• However, small object segmentation is challenging due to dominant larger receptive
fields.

Pyramid Scene Parsing Network (PSPNet)

To address global-local feature integration, Zhao et al. proposed PSPNet, which enhances
FCNs using a pyramid pooling module.

PSPNet Workflow:

1. Feature extraction: A CNN (ResNet) extracts feature maps from the input image.
2. Pyramid pooling: Multi-level pooling (1×1, 2×2, 3×3, 6×6) aggregates contextual
information.
3. Feature compression: Feature maps are passed through 1×1 convolutions for
dimensional reduction.
4. Upsampling & fusion: Pooled features are upsampled and concatenated with original
feature maps for final pixel-wise classification.

Key Findings from PSPNet Experiments:

• Average pooling performs better than max pooling.


• Multi-level pyramid pooling improves segmentation over global-only pooling.
• Feature dimensionality reduction helps maintain efficiency.
• Auxiliary loss aids deep network optimization.

PSPNet won 1st place in the ImageNet Scene Parsing Challenge 2016 and achieved state-
of-the-art performance on PASCAL VOC 2012 and Cityscapes datasets.

Conclusion

Deep learning, particularly FCN-based architectures like PSPNet, has significantly


advanced semantic segmentation in autonomous driving, ensuring more precise road scene
understanding for safer navigation.
Stereo and Optical Flow in Autonomous Driving

4.4 Stereo and Optical Flow

Stereo vision and optical flow are key techniques for depth estimation and motion analysis
in autonomous driving. Both involve matching corresponding points between two images.

4.4.1 Stereo Vision

• Content-CNN (Siamese Architecture):


o Uses two CNN branches (for left and right images) with shared weights.
o Outputs are merged via an inner-product layer to estimate pixel disparity.
o Disparity estimation is treated as a classification problem over possible
disparity values.
o Achieves fast processing on the KITTI Stereo 2012 dataset.
o Post-processing (e.g., local smoothing, semi-global matching) enhances
accuracy and 3D depth estimation.

4.4.2 Optical Flow

• FlowNet (Encoder-Decoder Architecture):


o FlowNetSimple: Stacks images and applies convolution layers, but is
computationally heavy.
o FlowNetCorr: Extracts features separately, merges via a correlation layer,
and applies convolutions.
o Uses “up-convolution” to restore resolution after compression.
o FlowNet achieves competitive performance on KITTI with 0.15-sec GPU
inference time.
o SpyNet refines optical flow estimation using a coarse-to-fine spatial
pyramid approach.
o SpyNet achieves state-of-the-art performance with a lightweight model,
ideal for mobile applications.

4.4.3 Unsupervised Learning for Dense Correspondence

• Challenge: Ground truth dense correspondence is expensive and difficult to collect.


• MonoDepth & MonoDepth2 use unsupervised learning from video frames.
o Loss components:
1. Appearance Matching Loss – Assumes corresponding pixels in two
views are visually similar.
2. Disparity Smoothness Loss – Enforces local smoothness with
occasional depth discontinuities.
3. Left–Right Disparity Consistency Loss – Ensures disparity
consistency between left and right views.
o Uses an encoder-decoder structure with skip connections.
o Performs better than traditional methods, with improvements by increasing
training data.

You might also like