0% found this document useful (0 votes)
17 views

8 ObectDectection

Uploaded by

reach geeks
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

8 ObectDectection

Uploaded by

reach geeks
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 60

CSCI E-25

Computer Vision
Object Detection
Steve Elston

Copyright 2020, 2021,2022, 2023, Stephen F Elston. All rights reserved.


Overview of Object Detection
Goal of object detection is to localize and identify objects in an image
• Real-world scenes are complex with multiple objects
• Localizing and identifying multiple objects is key to scene understanding
• Task related to semantic segmentation
• Localization parameterized by bounding box
• Is a dense CV task
• Object detection and segmentation are dense CV tasks
• Classification is not dense
• Objects can occur anywhere in an image
• Localize objects to pixel-level
• Classification does not require pixel-level accuracy
Overview of Object Detection
Goal of object detection is to find, localize and identify objects in an image
• Is a multi-task AI problem
• Finding numeric values of bounding box location and dimensions is a
regression problem
• Identification of objects in bounding boxes is a classification problem
• Training models requires a multi-task loss function!

Try it yourself! Object detection is widely used commercially


https://cloud.google.com/vision/automl/object-detection/docs/
https://docs.microsoft.com/en-us/azure/cognitive-services/custom-vision-service/get-st
arted-build-detector

https://huggingface.co/models?pipeline_tag=object-detection
Overview of Object Detection
As alternative can use semantic segmentation – closely related algorithms
• Semantic segmentation is a primary tool for scene understanding
• More on this method in another lesson
Overview of Object Detection
Object detection is a hard problem - Real-world scenes are cluttered

• Hard to uniquely detect


and classify all objects
• This object is both a vase
and a potted plant
Lesson Overview
• Elements of object detection algorithms
• Parameterization of bounding boxes
• Evaluation of object detection algorithms
• Multiple prior bounding boxes
• Solving the object detection problem
• Working with multiple scales
• Single-shot algorithms for video
• Transformer architectures
Overview of Object Detection
What approaches might be used to find objects in a complex scene?
• Example: Classical approach
• Compute features – e.g. HOGs
• Search to localize objects
• Classify objects detected
• Example: Classical approach
• Compute embedding – e.g. PCA
• Locate similar patches
• e.g. the eigen-faces algorithm
• Use deep neural networks
• Dramatic increase in accuracy
• speed may be sacrificed
• Create feature map with CNN
• Localize and classify objects
Elements of Object Detection Algorithms

Object detection algorithms have some common elements


• Convolutional Neural Network: CNN backbone creates a feature map
which is used to detect and classify objects
• Candidate bounding boxes: Multiple candidate bounding boxes are
generated for each region
• Bounding box maximal suppression: The probability of an object being
in each bounding box (objectness), and low probability boxes are
suppressed
• Minimal bounding boxes: The size of the bounding boxes is adjusted to
best fit the objects
• Identification: Classify the object in each bounding
Overview of Object Detection
Object detection is a hard problem - Occultation is common in real-world scenes

• How many people are in


this scene?
• Are these heads or
people?
• What is this object?
Overview of Object Detection
Object detection is a hard problem – Trade-offs of speed, categories and accuracy

• Selection of object detection


models has multi-dimensional

Increasing number of categories


Complex multi-step
trade-offs
models
• Less confidence in classification Inc

Increasing confidence –
rea

Classification accuracy
with increasing number of
categories co sing
mp m
• Lower complexity straight-through lex ode
ity l
models are faster
• Chose model to meet
requirements Single shot
detectors

Increasing frame rate - speed


Evolution of Object Detection Algorithms
An incomplete list of seminal object detection algorithms
• Erhan et. al., 2013, Scalable Object Detection using Deep Neural
Networks, introduced the R-CNN algorithm the first widely accepted
deep learning object detection algorithm. R-CNN demonstrated a
significant improvement in object recognition accuracy over classical
methods. However, this algorithm is too slow for real-time video
processing.
• Girshick, 2015, Fast R-CNN simplified the required computations but still
too slow for real-time video.
Evolution of Object Detection Algorithms
An incomplete list of seminal object detection algorithms
• Ren et. al., 2016, Faster R-CNN algorithm, but computational complexity
of the algorithm was still rather high.
• He, et. al. in 2018 Mask R-CNN algorithm exhibits significantly improved
object detection accuracy, particularly when there are large numbers of
objects, such as flock of birds or a crowd of people. While not efficient
enough for real-time video, but accurate for complex scenes
Evolution of Object Detection Algorithms
An incomplete list of seminal object detection algorithms
• Single shot real-time object detection algorithms
• Lui et. al., 2016, Single shot Multibox Detector performs bounding box
fitting, object detection, and classification in one step. This single shot
algorithm provides real time performance for video
• Redmon, et. al. 2016, You Only Look Once: Unified, Real-Time Object
Detection (YOLO) is an alternative single shot detector. YOLO version 1
suffered from low accuracy
• Redmon, et. al., 2016, YOLO 9000: Better, Faster, Stronger (aka YOLO v2)
made several improvements over the original algorithm. Included the
combination of efficient CNN, larger, integrated training data set.
Evolution of Object Detection Algorithms
An incomplete list of seminal object detection algorithms
• Single shot real-time object detection algorithms
• Redmon, et. al., 2016, YOLOv3: An Incremental Improvement, primarily
new CNN.
• Lin, et. al., 2018, Focal Loss for Dense Object Detection
• Bochkovskiy, et. al., 2020, YOLOv4: Optimal Speed and Accuracy of
Object Detection
• YOLOv5, YOLOv6, not published – so who knows??
• Chen, et. al., 2022 proposed a simplified transformer architecture for
dense CV tasks. This work may point toward the future
Parameterization of Bounding Boxes
Need a stable parameterization of 4 parameters of bounding box
• Start with a prior, prototype, or anchor for
the bounding box pw
• cx, cy is center of the prior
• pw is the width prior ph cx, cy bx, by bh
• ph is the height prior
• The compute the best fit box bw
• bx, by is center of bounding box
• bw is the width of the bounding box
• bh is the Hight of the bounding box
Parameterization of Bounding Boxes
Need a stable parameterization of 4 parameters of bounding box
• A naive approach is to solve a linear system
of equations for parameters, tx, ty, tu, th:
pw

ph cx, cy bx, by bh

bw

• But parameters of the bounding box are


unconstrained!
• Solution can be unstable
Parameterization of Bounding Boxes
Need a stable parameterization of 4 parameters of bounding box
• A better parameterization is proposed:
pw

ph cx, cy bx, by bh

bw
• The bounding box is now constrained and
the parameterization is stable
• p0 is the probability the box contains an
object
Evaluation of bounding box proposals
How can we evaluate bounding boxes computed with object detection?
• Compare the computed bounding box with the marked bounding box
(label)
• Use the ratio of the area of the intersection divided by the area of the
union
• Intersection over union or IoU metric
• Range:
• 0.0 – no overlap
• 1.0 – perfect match
Evaluation of bounding box proposals
How can we evaluate bounding boxes computed with object
detection?

Intersection Union
Evaluation of bounding box proposals
How can we evaluate bounding boxes computed with object detection?
• The closer the prediction is to the ground-truth bounding box the higher
the IoU
• We say higher IoU predictions have greater confidence

Figure from Balasubramaniam and Pasricha, 2022


Multiple Prior Bounding Boxes
• Images can contain
many objects
• SSD uses a grid to divide
the image
• Can fit bounding boxes
around centroids in
each of the grid cells
• Use odd grid
dimensions so there is a
centroid at the center
of image
Multiple Prior Bounding Boxes
• Images contain many
objects
• Impose grid over image
• Locate objects on the
grid
Multiple Prior Bounding Boxes
There are many possible bounding box proposals
• Start with a first bounding box proposal, with
centroid
• Boxes with different aspect ratios and same
centroid
• Apply non-maximal suppression to box
estimates using proposals as prior
Multiple Prior Bounding Boxes
• Multiple objects in
image
• Bounding box
prototypes center on
grid elements
• Multiple prior bounding
box candidates
• SSD uses multiple grids
to accommodate
multiple scales
Finding Priors for Bounding Boxes
Good priors are required for solution
• Priors for VOC and COCO for
YOLO models
• For both data sets, tall and
narrow priors are favored
Performance comparison and trade-offs
YOLOv2 (9000) uses overlapping bounding boxes at multiple scales
• Starts with a grid of anchor
boxes on a single grid
• Priors for multiple scale
bounding boxes around
anchor boxes
• Maximal suppression of
bounding boxes using
probability map
• Result is bounding boxes at
multiple scales
• From Redmon, et. al. 2016
Solving Object Detection Problem
Solve as object detection as regression problem
Find bounding box (regression), objectness (or probability no object), and
category (c1,c2,…,cn), as label for box
Solving Object Detection Problem
Solve as object detection as multi-task problem
• Can formulate the problem with label for
multiple bounding boxes.
• Solve box regression problem and identification
problem in one step
• Use fully convolutional neural networks
Solving Object Detection Problem
Solve as object detection as regression problem
Find most probable bounding box with non-max suppression algorithm:

Filter all boxes with p0 below threshold, say 0.5


While( more than one overlapping box ):
Select the remaining boxes with the highest probability
Compute the IoU for overlapping bounding boxes
Compute probability of objects in boxes, P(ci)
Suppress (filter out) bounding boxes with f(IoU, P(ci))
below threshold
Solving Object Detection Problem
Find most probable bounding box with non-max suppression algorithm:
• Most bounding boxes will not optimally contain an object
• Imbalance with many more true negatives than true positives
• Can lead to poor model training
• SSD solves this problem with a hard negative mining algorithm:
1. Sort bounding boxes by confidence score
2. Retain only most confident cases to maintain a 3:1 ratio of negative to
positive cases.
Loss functions for object detection
How can we construct a multi-task loss function for this problem?
• Execute tasks sequentially
• First step identify bounding box
• Second step classify objects
• Each step trained individually with specific loss function
• Examples include R-CNN algorithm
• Multiple steps are inherent performance bottleneck
• Combine tasks
• Use a multi-task loss function for training
• Used in efficient single shot algorithms; e.g. SSD and YOLO
• Suitable for video rates
Loss functions for object detection
How can we construct a multi-task loss function for this problem?
• Need loss component for bounding box localization accuracy
• Need loss component for identification confidence accuracy
• Example: For N matched bounding boxes, SSD uses this loss function:

Where:
is a trade-off parameter between confidence and location accuracy
is a binary indicator tensor for matching the i-th prototype box to the j-th
ground truth box
c is the class of the object
l are parameters of predicted box and g are parameters of ground truth box
Loss functions for object detection
How can we construct a multi-task loss function for this problem?
• Loss component for bounding box localization accuracy uses a smooth L1
distance with respect to the ground truth box location, , of the i-th box

Where , the bounding box predication has four components for center, {cx, cy},
and dimensions, {w, h}, with respect to the default bonding box, d:
Loss functions for object detection
How can we construct a multi-task loss function for this problem?
• Confidence loss component for correct identification at each location:

Where, the class probability prediction is given by the softmax function:

And,
is the p-th category in box prediction i
is the category for no object in box prediction i
Loss Functions for Object Detection
Class imbalance with object detection
• Class imbalance is a significant problem when training object
detection models
• Example: Foreground objects are generally only a small fraction of pixels
• Example: Many types of small-area background categories – e.g. stripes on a
road
• To overcome class imbalance problems, Li, et. al., use two
approaches:
• Focal loss is applied in the position head
• Training the end-to-end network uses Dice loss
Loss Function for Object Detection
Dice-Sørensen coefficient, or Dice loss, is considered more robust to
class imbalance
• For two sets, , , the Dice-Sørensen coefficient is defined:

• For the simple case of binary classification, we write Dice loss:

Where:
label
Binary category prediction
• Dice loss is equivalent to F1 loss
• Full details on loss functions for training semantic segmentation models
can be found in Jadon, 2020, or Jeremy Jordan’s blog post
Loss Function for Object Detection
Focal loss addresses class imbalance by reweighting cross-entropy
• We can write binary cross-entropy
in the well known form:

• Focal loss reweights cross-entropy:

- Where:
hyperparameter,
cross-entropy
• The term down-weights easy to
learn categories
Find many more details in Lin, et. al, 2018
Evaluation of object detection
Need multiple criteria to evaluate object detection
• Is there an object in the box?
• Is the bounding box correct?
• Is the object classification in the box correct?
• Need metrics for accuracy of bounding box and object class predictions
• Average precision measured on recall-precision trade-off curve
• Use mean average precision – mAP
• Mean taken over average precision over all object classes
Evaluation of object detection
Review of the classification model metrics
• Selectivity or Precision:
– Fraction of cases classified as positive which are correctly classified

• Sensitivity or Recall:
– The fraction positive cases correctly classified

• There is an inherent trade-off between precision and recall


• Can average precision on recall-precision curve
– Change threshold or confidence to quantify curve
Evaluation of object detection
Computing Average Precision - AP

• Precision decreases as recall


increases
• Recall increases as confidence (IoU)
increases

Precision
• Average precision is Area Under the
Curve (AUC) of precision-recall curve
• Approximate AUC as sum of area of
rectangles at each threshold sample
• Usually sample precision at 10
threshold (IoU) values Increasing threshold, IoU
Increasing Recall
Evaluation of object detection
Computing mean average precision - mAP
1. Compute average precision for each of c classes
2. Compute mean of precision for all c classes
3. Report mAP as percentage
• Prefect performance = 100%
• No correct detection and classification = 0%
Architectures of object detectors
Architectural components of single shot object detectors:
detectors Example YOLOv4

Backbone
Neck to Head detects Sparse Prediction
convolutionally NN
accommodate objects and applies non-
creates the feature
multiple scales identifies them maximal
map
suppression
Backbones: CNNs Create Feature Maps
Many choices have been tried
• VGG-16
• ResNet-52
• EfficientNet-BO/B7
• Darknet-53
• Others…
Neck: Working with multiple scales
Images contain objects a multiple scales
• Need to detect objects across wide range of scale
• Is trade-off between semantics and detail
• Large scale has better semantics
• Fine scale has more detail
• Deep neural network architecture produces multiple scales
• Convolution with max pooling reduces detail
• Deeper layers with better semantics
Neck: Working with multiple scales
Images contain objects a multiple scales
• Need to detect objects across wide range of scale
• Is trade-off between semantics and detail
• Large scale has better semantics
• Fine scale has more detail
• Deep neural network architecture produces multiple scales
• Convolution with max pooling reduces detail
• Deeper layers with better semantics
Neck: Working with multiple scales
Convolutional neural network with multi-scale feature map (pyramid)
Convolution/max-
polling layers

Predict

bounding boxes
Multi-scale
Predict

Predict
Image

Convolution/up-
Head for detection and
sampling
identification
Backbone Convolutional Layers Convolutional Layers of neck
Creates feature map Multiple scales
Straight-Through Architectures
Example Single Shot Detector, SSD

• SSD takes a different approach to the speed-accuracy trade-off


• SSD achieves efficiency by scoring multiple bounding boxes at different
scales simultaneously using convolutional layers
Straight-Through Architectures
Example Single Shot Detector, SSD

• SSD is a fully convolutional network


• No fully connected layers
Straight-Through Architectures
Example Single Shot Detector, SSD

VGG-16 Backbone Convolutional Layers


Creates feature map
Straight-Through Architecture
Example Single Shot Detector, SSD

Convolutional Layers of neck down sample to multiple scales


Detection and classification on grids
Straight-Through Architectures
Example Single Shot Detector, SSD

Head layers
Output for each box and class
Straight-Through Architectures
Architectural components of single shot object detectors:
detectors Example YOLOv4

Backbone DarkNet- Neck to Head detects Sparse Prediction


53 creates the accommodate objects and applies non-
feature map multiple scales identifies them maximal
suppression
Figure from Bochkovskiy, et. al., 2020
YOLOv4
YOLOv4 incorporates multiple improvements for better performance

• YOLOv4 introduced mosaic


data augmentation
• Mosaic created from patches
of several images
• Gives greater diversity of
backgrounds and objects in
augmented images

Figure from Bochkovskiy, et. al., 2020


YOLOv4
YOLOv4 incorporates multiple improvements for better performance

• S = Sensitivity for bounding box


• M = Mosaic augmentation
• IT = Multiple anchors for single
ground truth
• GA = Genetic algorithm for
network model selection
• OA = Optimized anchors for 512
x 512 image

Figure from Bochkovskiy, et. al., 2020


YOLOv4
YOLOv4 incorporates multiple improvements for better performance

Figure from Bochkovskiy, et. al., 2020


Transformer Architecture for Object Detection
Chen, et. al., 2022 proposed a simplified transformer architecture
for dense CV tasks
Pure ViT transformer architecture Task specific
to create feature map heads

Constant window size (UViT) or increasing window size


with depth (+) to achieve high density efficiently
Transformer Architecture for Object Detection
Chen, et. al., 2022 showed simple architecture is superior to more complex
hand-engineered architectures

• SD = Spatial down-sampling
• MF = Multi-scale features
• 2x = Doubled channels
Transformer Architecture for Object Detection
Chen, et. al., 2022 showed simple architecture is superior to more complex
hand-engineered architectures

• SD = Spatial down-sampling
• MF = Multi-scale features
• 2x = Doubled channels
Transformer Architecture for Object Detection
Chen, et. al., 2022 propose transformer architecture which may be a path
for future dense CV tasks

How well does the pure transformer model work?


Summary
• Elements of object detection algorithms
• Parameterization of bounding boxes, maintain stability
• Multiple prior (anchor) bounding boxes
• Evaluation of object detection algorithms, mAP
• Solving the object detection problem, multi-task loss functuion
• Working with multiple scales
• Single-shot algorithms for video
• Transformer architecture

You might also like