Final Report - Removed
Final Report - Removed
LITERATURE SURVEY
The evolution of object detection in images and videos has been shaped
by groundbreaking research in machine learning and deep learning. Ren et al.'s
Faster R-CNN introduced Region Proposal Networks (RPNs), streamlining
detection and enabling real-time performance. Redmon et al.'s YOLO reframed
detection as a regression problem, reducing computational complexity and
enhancing speed, ideal for applications like video surveillance. Liu et al.'s SSD
introduced single-stage detection, improving efficiency and enabling
deployment on resource-constrained devices. Lin et al.'s FPN addressed scale
variation by integrating multi-scale feature maps, enhancing object detection
across sizes. He et al.'s Mask R-CNN extended Faster R-CNN with instance
8
segmentation capabilities, broadening applications in medical imaging and
video editing. Recent advancements include YOLOv5, offering improved
performance and efficiency, and EfficientDet, achieving superior performance
with fewer parameters. DETR introduces an end-to-end Transformer-based
architecture for accurate and interpretable detections. CenterNet simplifies
detection by directly predicting object centers, suitable for resource-limited
scenarios. Cascade R-CNN iteratively refines object proposals for enhanced
detection quality, particularly in challenging scenarios. NAS-FPN automates
feature pyramid network design, enhancing performance and scalability. PANet
introduces a path aggregation network for precise instance segmentation,
improving object understanding.
Fig 2.1 illustrates how seminal works in object detection have advanced
performance and expanded applicability across diverse domains.
9
2.2 RELATED WORKS
Related
Faster R- CNN:
Towards Real-Time
Object Detection Ren, S.,
with Region He, K., 1. Fast R- CNN:
IEEE
Proposal Networks Girshick, 2016 Girshick, R.
TPAMI
R., &
Sun, J.
2. YOLOv4:
Redmon,
YOLOv3: An Bochkovskiy, A.,
J., &
Incremental 2018 arXiv Wang, C.
Farhadi,
Improvement Y., & Liao, H.
A.
Y. M.
EfficientDet: Scalable
2.
and Efficient Object Tan, M.,
EfficientNet: Tan,
Detection Pang, R., & 2020 CVPR
M., & Le, Q.
Le, Q.
10
2. Vision
DETR: End-
Carion, N., Transformers:
to-End Object 2020 ECCV
et al. Dosovitskiy, A., et
Detection
al.
2. Vision
DETR: End- Carion, N., Transformers:
to-End Object 2020 ECCV
et al. Dosovitskiy, A., et
Detection with al.
Transformers
CenterNet: Object
Detection with
Center- Aspects Mo, K., 2. FCOS:
2019 CVPR
Estimation et al. Tian, Z., et al.
2. YOLOv4:
SSD: Single Shot
Bochkovskiy, A.,
MultiBox Detector Liu, W.,
2016 ECCV Wang, C.
et al.
Y., & Liao, H.
Y. M.
Let us delve deeper into each key area of focus for object detection systems
using machine learning and deep learning approaches shows in Fig 2.2:
13
2.4 LIMITATIONS
14
CHAPTER 3
SYSTEM DESIGN
The system architecture shows in Fig 3.1 for an object detection system
using deep learning from videos is a comprehensive framework designed to
manage the intricate processes involved in detecting objects within video
streams. At its core, the Data Acquisition Module serves as the system's entry
point, facilitating the capture of video data from various sources such as
cameras, video files, or live streams. Once acquired, the Frame Segmentation
Module meticulously extracts individual frames from the continuous video
stream, employing techniques like keyframe extraction or motion-based
segmentation to efficiently segment the video into manageable units. Following
segmentation, the Image Pre-processing Module meticulously prepares each
frame for analysis by applying a suite of transformations to enhance its quality
and standardize its format. This involves resizing frames, normalizing pixel
values, and reducing noise or artifacts that could impede accurate object
detection.
15
Fig 3.1 System Architecture
● Data Acquisition
● Frame Segmentation
● Data Pre-processing
● Feature Extraction
● Classification
Data Acquisition: This step involves gathering video data from various sources
like surveillance cameras or video files. Data augmentation techniques may be
16
employed to enhance system robustness by collecting videos from different
viewpoints or lighting conditions. Metadata such as timestamps may also be
collected for further analysis.
Image Pre-processing: Pre-processing prepares frames for input into the object
detection model by resizing them to a uniform size, normalizing pixel values, and
applying filters to enhance quality. This ensures consistency and optimization of
input frames for the detection model's requirements.
17
3.3 DATA FLOW MODEL
The below represented diagrams are the data flow diagrams which
explains our work.
In the level 0 shows in Fig 3.3 of the data flow diagram the basic flow of
the project. The input data contains the COCO database and the result is the
output data. The cleaned data will be used for the detection process. Then a
detailed analysis on detecting the object from video will be made on various
parameters. The process between them involves analysing and detection which
results in the images with the frames for the many purposes to detect the object.
At Level 1 shows in Fig 3.4 of the Data Flow Diagram (DFD) for an
object detection system using deep learning from videos, the central module is
18
Object Detection System (Main), which oversees the entire process. This system
interfaces with two primary components: the Input Video File and the User
Interaction Interface. The Input Video File represents the source of video data,
which could be a video file stored locally or a video stream obtained from a
camera or other sources. On the other hand, the User Interaction Interface
facilitates interaction with the system, enabling users to provide input
parameters or view the system's output. Once the video data is received, it
undergoes Frame Pre-processing, a module responsible for preparing each
frame of the video for object detection.
19
Display/Save module handles the presentation or storage of the detected objects.
It may display the detected objects to the user in real-time or save them to a file
for later analysis. This evaluation provides valuable insights into the system's
performance and helps in its refinement and optimization. Overall, these
interconnected components form a cohesive system for object detection from
videos using deep learning techniques.
20
3.4 UML DIAGRAMS
The following UML diagrams are used to describe the project, they are:
There are three actors involved in the use case diagram are User,
Detection system, and database shows in Fig 3.5. The use cases associated with
users are uploading the object video and output is given from the detection
system. The use cases associated with database are data storage, data
management. The use cases associated with Detection system are frame
segmentation, processing the image, feature extraction and classification.
21
3.4.2 ACTIVITY DIAGRAM
22
3.4.3 SEQUENCE DIAGRAM
The below sequence diagram shows in Fig 3.7 the interactions between
the various components involved in the Object detection process. The lifeline
will be explained by the User, Detection system and Database. The User
uploads the video which is stored into the database. The process the image file
by frame segmentation and data pre-processing. The extracted features are then
used for machine learning classification, which uses algorithms to detect the
object in the video which is given by the user. Finally, an image of detection is
shown based on the machine learning results, summarizing the findings and
providing recommendations for further detection.
23
3.5 WORKFLOW
Step 1: Gather video analysis parameters and objectives from users to tailor the
process.
Step 2: Read the video file or stream, ensuring compatibility and smooth data
ingestion.
Step 4: Initialize the object detection model, such as YOLO, ensuring proper
configuration and optimization.
Step 5: Train the model if required, leveraging labeled datasets for fine-tuning
and improved accuracy.
Step 6: Detect objects within each pre-processed frame using the initialized
model, capturing relevant features.
Step 8: Optionally visualize detected objects overlaid on the original frame for
intuitive analysis.
Step 10: Output detected objects' labels, confidence scores, and bounding box
coordinates, facilitating interpretation and action.
24
CHAPTER 4
SYSTEM IMPLEMENTATION
The seamless integration of the YOLOv7 model into the video processing
pipeline equips the system with the capability to accurately detect and classify
objects within video streams, bolstering its efficacy in domains like
surveillance, traffic monitoring, and video analytics. Through these steps, the
object detection system achieves a holistic understanding of the video content,
facilitating informed decision-making and enhancing situational awareness in
diverse scenarios.
25
Fig 4.1 YOLO V7 Model
26
Feature Extraction: The Traced YOLOv7 model employs a convolutional
neural network (CNN) backbone to automatically extract pertinent features
from pre-processed frames. These features capture critical object characteristics
such as edges, textures, and shapes.
4.3 APPROACH
The approach of the Traced YOLOv7 model shows in Fig 4.2 involves a
systematic process geared towards enabling efficient deployment and real-time
object detection in videos. Initially, the YOLOv7 model undergoes
comprehensive training using a diverse dataset of annotated images and videos,
where it learns to accurately detect and classify objects within the visual
datathrough optimization of its parameters based on defined loss functions.
Following training, the model is subjected to optimization techniques aimed at
streamlining its architecture for deployment on edge devices. This optimization
phase typically encompasses model quantization, pruning, and compression to
reduce its size and computational complexity while maintaining performance
integrity.
27
Fig 4.2 Approach
28
4.4 ADVANTAGES OF USING TRACED YOLO V7
MODEL CLASSIFIER OVER OTHER CLASSIFIERS
Accurate Object Detection: The YOLOv7 architecture, upon which the Traced
YOLOv7 model is based, is known for its high accuracy in object detection
tasks. By leveraging advanced deep learning techniques and multi-scale feature
extraction, the model can accurately detect objects of varying sizes and aspect
ratios within video frames.
29
Multi-Object Detection: The Traced YOLOv7 model excels at detecting
multiple objects within a single frame simultaneously. This capability is
particularly advantageous in crowded scenes or scenarios where numerous
objects need to be detected and classified concurrently.
The Steps involved in object detection system explained in Fig 4.3 are,
30
4.5.1 DATA ACQUISITION
Data acquisition for an object detection system using deep learning from
videos is a crucial process that involves gathering video data from various
sources to create a comprehensive dataset for model training and evaluation.
Initially, the identification of suitable data sources, including surveillance
cameras, online repositories, or recorded video streams, sets the stage for
collecting the necessary footage. Prior to acquisition, it's imperative to address
legal considerations and obtain permissions if the videos contain sensitive
information. Once obtained, the videos undergo annotation and labelling to
mark the presence and location of objects within frames, a vital step for
supervised learning. Following annotation, the dataset may undergo pre-
processing tasks such as resizing, format conversion, or noise reduction to
ensure consistency and quality. Subsequently, the dataset is divided into
training, validation, and testing sets to facilitate model training and evaluation.
Additionally, data augmentation techniques may be applied to increase dataset
diversity and improve model generalization. Ultimately, well- structured storage
and management of the annotated and pre-processed video dataset are
paramount for seamless access during model development and deployment.
Through meticulous data acquisition, an object detection system can be trained
effectively, enhancing its ability to accurately identify objects within video
streams.
31
representative frames are extracted while minimizing computational load.
Adjusting the frame rate balances resources with temporal information.
Segmented frames are then organized for further processing, facilitating the
application of object detection algorithms in tasks like surveillance and
autonomous navigation.
32
4.5.4 FEATURE EXTRACTION
As frames pass through the CNN backbone, feature maps are generated at
multiple levels, encoding representations of object presence and spatial
relationships. Additionally, feature pyramid networks (FPNs) aggregate these
maps, enhancing detection across various scales and sizes.
Utilizing anchor boxes and prediction layers, the model generates bounding
boxes and confidence scores for detected objects based on extracted features,
facilitating object localization and presence indication.
Before building the model, separate the data into two parts, one is
Training data and another is Test data. shows the diagrammatic representation of
Fig 4.4 how data is split into training and testing which is used to obtain the
detection of objects from videos.
33
4.5.6 CLASSIFICATION
In the classification step of object detection using the Traced YOLOv7 model,
features are first extracted from pre-processed video frames, capturing spatial
information, textures, and patterns. These features are then passed through
detection heads within the Traced YOLOv7 architecture.
Utilizing anchor boxes and prediction layers, the model generates bounding
boxes enclosing detected objects, alongside confidence scores indicating object
presence likelihood. Subsequently, the classifier assigns class probabilities to
each detected object based on learned feature representations, ensuring accurate
classification into predefined categories.
After the classification step with the Traced YOLOv7 model classifier,
the detected objects and their corresponding bounding boxes undergo further
evaluation using a pre-trained ResNet-101 model. ResNet-101, a deep
convolutional neural network architecture known for its exceptional
performance in various computer vision tasks, is employed to perform several
key tasks. Firstly, it recognizes and classifies the detected objects within the
video frames, leveraging its deep architecture and learned feature
representations to accurately identify objects based on visual characteristics
34
ResNet-101 facilitates fine-grained classification, distinguishing closely
related categories and offering detailed insights into video frame content.
Additionally, it extracts high-level features from detected objects, capturing
crucial semantic and contextual cues for comprehensive understanding.
Furthermore, ResNet-101 performs semantic segmentation, assigning each pixel
to a specific object class, enabling detailed spatial analysis. Finally, evaluation
of the ResNet-101 output utilizes metrics like Mean Average Precision (mAP)
to assess performance.object detection system accurately. By combining the
capabilities of Traced YOLOv7 for initial detection and ResNet-101 for detailed
evaluation, the object detection system achieves robust and reliable object
recognition in diverse video scenarios.
35
CHAPTER 5
TESTING
Testing for object detection systems using deep learning from videos is a
crucial phase in assessing the performance and reliability of such systems
before deployment in real-world scenarios. This testing process involves
evaluating the model's ability to accurately detect and classify objects within
video frames, ensuring robustness and effectiveness across diverse conditions
and scenarios. By subjecting the object detection system to rigorous testing,
developers can identify potential weaknesses, optimize model parameters, and
enhance overall performance.
Our work is divided into five primary sections for testing shows in Table
5.1. The loading dataset, Frame segmentation, Image pre-processing, Feature
extraction, and Classification are the five modules. The test cases for the
modules and submodules have been checked and passed. The test case id, test
case scenario, test case secondary considerations, and test case state are all
considered when testing our work. Test cases are validated, and the outcomes
and status for the scenarios are handled.
36
TC SCENARIO SECONDARY CONSIDERATION STATE
ID
37
CHAPTER 6
FUTURE WORK
Some potential future directions for improving the object detection system
based on the YOLO V7 model:
38
APPENDIX A
SAMPLE SCREENSHOTS
39
Fig A.2 Upload a video
40
OUTPUT
41
Fig A.6 Output 3
42
APPENDIX B
SAMPLE CODING
In this appendix, we outline the code snippets employed in our
development and evaluation of an AI-powered object detection system for video
analysis. Users are guided to input essential parameters and upload videos for
object detection. Noteworthy is the incorporation of a specialized deep learning
model that allows users to interact with the system's output, enabling detailed
analysis of detected objects. Moreover, the implementation includes robust
error- handling mechanisms to ensure a smooth user experience and precise
detection results, thereby improving the system's robustness and user
satisfaction.
Yolo.py:
import argparse
import logging
import sys
import torch
43
from utils.torch_utils import time_synchronized, fuse_conv_and_bn,
model_info, scale_img, initialize_weights, \
select_device, copy_attr
try:
except ImportError:
thop = None
class Detect(nn.Module):
end2end = False
include_nms = False
concat = False
44
a = torch.tensor(anchors).float().view(self.nl, -1, 2)
self.register_buffer('anchors', a) # shape(nl,na,2)
self.register_buffer('anchor_grid', a.clone().view(self.nl, 1, -1, 1, 1, 2)) #
shape(nl,1,na,1,1,2)
z = [] # inference output
self.training |= self.export
for i in range(self.nl):
if self.grid[i].shape[2:4] != x[i].shape[2:4]:
y = x[i].sigmoid()
if not torch.onnx.is_in_onnx_export():
45
include_nms = False
concat = False
a = torch.tensor(anchors).float().view(self.nl, -1, 2)
self.register_buffer('anchors', a) # shape(nl,na,2)
self.register_buffer('anchor_grid', a.clone().view(self.nl, 1, -1, 1, 1, 2)) #
shape(nl,1,na,1,1,2)
z = [] # inference output
self.training |= self.export
46
x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4,
2).contiguous()
if self.grid[i].shape[2:4] != x[i].shape[2:4]:
y = x[i].sigmoid()
if not torch.onnx.is_in_onnx_export():
else:
wh = wh ** 2 * (4 * self.anchor_grid[i].data) # new wh
if self.training:
out = x
elif self.end2end:
out = torch.cat(z, 1)
47
args[1] = [list(range(args[1] * 2))] * len(f)
elif m is ReOrg:
c2 = ch[f] * 4
elif m is Contract:
c2 = ch[f] * args[0] ** 2
elif m is Expand:
c2 = ch[f] // args[0] ** 2
else:
c2 = ch[f]
layers.append(m_)
if i == 0:
ch = []
48
ch.append(c2)
parser = argparse.ArgumentParser()
opt = parser.parse_args()
set_logging()
device = select_device(opt.device)
# Create model
model = Model(opt.cfg).to(device)
model.train()
if opt.profile:
y = model(img, profile=True)
49
APPENDIX C
SYSTEM REQUIREMENTS
HARDWARE SPECIFICATION
SOFTWARE SPECIFICATION
● Windows 10 or Higher
● Visual Studio Code
BROWSERS
● Chrome
● Edge
● Mozilla Firefox
● Internet Explorer
● Safari
DATASETS
● COCO
50
50