8 ObectDectection
8 ObectDectection
Computer Vision
Object Detection
Steve Elston
https://huggingface.co/models?pipeline_tag=object-detection
Overview of Object Detection
As alternative can use semantic segmentation – closely related algorithms
• Semantic segmentation is a primary tool for scene understanding
• More on this method in another lesson
Overview of Object Detection
Object detection is a hard problem - Real-world scenes are cluttered
Increasing confidence –
rea
Classification accuracy
with increasing number of
categories co sing
mp m
• Lower complexity straight-through lex ode
ity l
models are faster
• Chose model to meet
requirements Single shot
detectors
ph cx, cy bx, by bh
bw
ph cx, cy bx, by bh
bw
• The bounding box is now constrained and
the parameterization is stable
• p0 is the probability the box contains an
object
Evaluation of bounding box proposals
How can we evaluate bounding boxes computed with object detection?
• Compare the computed bounding box with the marked bounding box
(label)
• Use the ratio of the area of the intersection divided by the area of the
union
• Intersection over union or IoU metric
• Range:
• 0.0 – no overlap
• 1.0 – perfect match
Evaluation of bounding box proposals
How can we evaluate bounding boxes computed with object
detection?
Intersection Union
Evaluation of bounding box proposals
How can we evaluate bounding boxes computed with object detection?
• The closer the prediction is to the ground-truth bounding box the higher
the IoU
• We say higher IoU predictions have greater confidence
Where:
is a trade-off parameter between confidence and location accuracy
is a binary indicator tensor for matching the i-th prototype box to the j-th
ground truth box
c is the class of the object
l are parameters of predicted box and g are parameters of ground truth box
Loss functions for object detection
How can we construct a multi-task loss function for this problem?
• Loss component for bounding box localization accuracy uses a smooth L1
distance with respect to the ground truth box location, , of the i-th box
Where , the bounding box predication has four components for center, {cx, cy},
and dimensions, {w, h}, with respect to the default bonding box, d:
Loss functions for object detection
How can we construct a multi-task loss function for this problem?
• Confidence loss component for correct identification at each location:
And,
is the p-th category in box prediction i
is the category for no object in box prediction i
Loss Functions for Object Detection
Class imbalance with object detection
• Class imbalance is a significant problem when training object
detection models
• Example: Foreground objects are generally only a small fraction of pixels
• Example: Many types of small-area background categories – e.g. stripes on a
road
• To overcome class imbalance problems, Li, et. al., use two
approaches:
• Focal loss is applied in the position head
• Training the end-to-end network uses Dice loss
Loss Function for Object Detection
Dice-Sørensen coefficient, or Dice loss, is considered more robust to
class imbalance
• For two sets, , , the Dice-Sørensen coefficient is defined:
Where:
label
Binary category prediction
• Dice loss is equivalent to F1 loss
• Full details on loss functions for training semantic segmentation models
can be found in Jadon, 2020, or Jeremy Jordan’s blog post
Loss Function for Object Detection
Focal loss addresses class imbalance by reweighting cross-entropy
• We can write binary cross-entropy
in the well known form:
- Where:
hyperparameter,
cross-entropy
• The term down-weights easy to
learn categories
Find many more details in Lin, et. al, 2018
Evaluation of object detection
Need multiple criteria to evaluate object detection
• Is there an object in the box?
• Is the bounding box correct?
• Is the object classification in the box correct?
• Need metrics for accuracy of bounding box and object class predictions
• Average precision measured on recall-precision trade-off curve
• Use mean average precision – mAP
• Mean taken over average precision over all object classes
Evaluation of object detection
Review of the classification model metrics
• Selectivity or Precision:
– Fraction of cases classified as positive which are correctly classified
• Sensitivity or Recall:
– The fraction positive cases correctly classified
Precision
• Average precision is Area Under the
Curve (AUC) of precision-recall curve
• Approximate AUC as sum of area of
rectangles at each threshold sample
• Usually sample precision at 10
threshold (IoU) values Increasing threshold, IoU
Increasing Recall
Evaluation of object detection
Computing mean average precision - mAP
1. Compute average precision for each of c classes
2. Compute mean of precision for all c classes
3. Report mAP as percentage
• Prefect performance = 100%
• No correct detection and classification = 0%
Architectures of object detectors
Architectural components of single shot object detectors:
detectors Example YOLOv4
Backbone
Neck to Head detects Sparse Prediction
convolutionally NN
accommodate objects and applies non-
creates the feature
multiple scales identifies them maximal
map
suppression
Backbones: CNNs Create Feature Maps
Many choices have been tried
• VGG-16
• ResNet-52
• EfficientNet-BO/B7
• Darknet-53
• Others…
Neck: Working with multiple scales
Images contain objects a multiple scales
• Need to detect objects across wide range of scale
• Is trade-off between semantics and detail
• Large scale has better semantics
• Fine scale has more detail
• Deep neural network architecture produces multiple scales
• Convolution with max pooling reduces detail
• Deeper layers with better semantics
Neck: Working with multiple scales
Images contain objects a multiple scales
• Need to detect objects across wide range of scale
• Is trade-off between semantics and detail
• Large scale has better semantics
• Fine scale has more detail
• Deep neural network architecture produces multiple scales
• Convolution with max pooling reduces detail
• Deeper layers with better semantics
Neck: Working with multiple scales
Convolutional neural network with multi-scale feature map (pyramid)
Convolution/max-
polling layers
Predict
bounding boxes
Multi-scale
Predict
Predict
Image
Convolution/up-
Head for detection and
sampling
identification
Backbone Convolutional Layers Convolutional Layers of neck
Creates feature map Multiple scales
Straight-Through Architectures
Example Single Shot Detector, SSD
Head layers
Output for each box and class
Straight-Through Architectures
Architectural components of single shot object detectors:
detectors Example YOLOv4
• SD = Spatial down-sampling
• MF = Multi-scale features
• 2x = Doubled channels
Transformer Architecture for Object Detection
Chen, et. al., 2022 showed simple architecture is superior to more complex
hand-engineered architectures
• SD = Spatial down-sampling
• MF = Multi-scale features
• 2x = Doubled channels
Transformer Architecture for Object Detection
Chen, et. al., 2022 propose transformer architecture which may be a path
for future dense CV tasks