Aiav Unit 2 Notes
Aiav Unit 2 Notes
Introduction
Datasets
Large datasets are essential for improving perception algorithms in autonomous vehicles.
These datasets help in quantitative evaluation, exposing weaknesses, and enabling fair
comparisons. Common datasets for computer vision tasks include those for image
classification, semantic segmentation, optical flow, stereo vision, and tracking.
Detection
Autonomous vehicles must detect various objects on the road, including cars, pedestrians,
obstacles, and lane markers. Object detection involves three main stages:
Pedestrian detection is particularly critical for safety, as human behavior is unpredictable and
often occluded. Modern detectors rely on convolutional neural networks (CNNs), which are
discussed in the next chapter.
Segmentation
Semantic segmentation enhances object detection by assigning a class label to each pixel,
providing a structured understanding of the environment.
Traditional Approach:
• Conditional Random Fields (CRF): A graphical model where nodes (pixels) are
assigned labels based on extracted features, ensuring spatial smoothness and object
coherence.
• Challenges: CRF struggles with long-range dependencies and computational
efficiency.
Advancements:
Autonomous vehicles require 3D spatial information for navigation. While LiDAR provides
precise but sparse depth data, stereo cameras offer dense visual information.
• Stereo vision mimics human binocular vision by capturing images from two slightly
different angles and solving a correspondence problem to estimate depth.
• Feature-based methods use distinctive features (e.g., SIFT, SURF) for matching but
provide sparse results.
• Area-based methods use spatial smoothness constraints to compute dense disparity
maps but require more computation.
• Global methods (e.g., Semi-Global Matching (SGM)) optimize disparity estimation
by minimizing energy functions, improving accuracy and efficiency.
• Deep learning-based methods now achieve the best performance in stereo matching
(discussed in the next chapter).
where B is the camera baseline, d is the disparity, and f is the focal length.
Optical Flow
Optical flow estimates 2D motion by tracking intensity changes between consecutive images.
Unlike stereo vision, which captures images simultaneously, optical flow must account for:
To improve robustness, alternative cost functions have been introduced to replace the
quadratic penalty in classical methods.
Scene Flow
Autonomous vehicles need 3D motion estimation rather than just 2D optical flow. Scene
flow extends optical flow by using two consecutive stereo image pairs to estimate both:
• 3D positions of points.
• 3D motion between time intervals.
The KITTI Scene Flow 2015 benchmark evaluates methods for accurate 3D motion
estimation, crucial for vehicle navigation and obstacle avoidance.
Object Tracking in Autonomous Vehicles
Tracking Overview
Tracking estimates an object's location, speed, and acceleration over time, allowing
autonomous vehicles to maintain safe distances and predict movement. This is particularly
challenging for pedestrians and cyclists due to sudden direction changes.
Challenges in tracking:
Tracking is traditionally modeled as a sequential Bayesian filtering problem with two main
steps:
A commonly used method is the Particle Filter, but its recursive nature makes recovery from
missed detections difficult.
Alternative Approaches
CNNs are a type of deep neural network that use convolution as the primary computational
operation. They were first introduced by LeCun et al. in 1988, inspired by the visual cortex's
structure. CNNs excel in computer vision tasks due to:
CNNs revolutionized computer vision, with models like AlexNet (2012) leading to state-of-
the-art autonomous driving perception systems.
Early detection methods relied on hand-crafted features and structured classifiers, but these
struggled with large data volumes and object variations.
Girshick et al. introduced R-CNN, proving CNNs outperform traditional methods. Faster R-
CNN improved detection by using a Region Proposal Network (RPN) for generating
potential object locations.
1. RPN generates region proposals by scanning feature maps using anchor boxes of
different sizes (e.g., 128×128, 256×256, 512×512).
2. ROI pooling refines proposals, mapping them to a fixed-size feature map.
3. Final classification and bounding box regression predict object type and precise
location.
Proposal-Free Algorithms
Some models avoid the region proposal step for real-time performance:
• SSD (Single Shot MultiBox Detector): Uses multiple convolutional layers to detect
objects of varying sizes in a single pass.
• YOLO (You Only Look Once): Directly predicts object locations and classes in one
forward pass, achieving high speed.
While proposal-free methods are faster, Faster R-CNN still achieves the highest accuracy in
benchmarks like PASCAL VOC. However, it struggles with small, occluded objects in
datasets like KITTI.
Recent methods like FCOS (Fully Convolutional One-Stage Detector) remove predefined
anchor boxes, making detection more flexible.
Conclusion
• FCNs transform traditional CNNs (e.g., VGG-19) by removing the softmax layer
and replacing fully connected layers with 1×1 convolutions.
• They allow input of any size and predict per-pixel labels for segmentation.
• However, small object segmentation is challenging due to dominant larger receptive
fields.
To address global-local feature integration, Zhao et al. proposed PSPNet, which enhances
FCNs using a pyramid pooling module.
PSPNet Workflow:
1. Feature extraction: A CNN (ResNet) extracts feature maps from the input image.
2. Pyramid pooling: Multi-level pooling (1×1, 2×2, 3×3, 6×6) aggregates contextual
information.
3. Feature compression: Feature maps are passed through 1×1 convolutions for
dimensional reduction.
4. Upsampling & fusion: Pooled features are upsampled and concatenated with original
feature maps for final pixel-wise classification.
PSPNet won 1st place in the ImageNet Scene Parsing Challenge 2016 and achieved state-
of-the-art performance on PASCAL VOC 2012 and Cityscapes datasets.
Conclusion
Stereo vision and optical flow are key techniques for depth estimation and motion analysis
in autonomous driving. Both involve matching corresponding points between two images.