0% found this document useful (0 votes)
17 views

Final Report - Removed

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Final Report - Removed

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

CHAPTER 2

LITERATURE SURVEY

Object detection in images and videos is a fundamental task in computer


vision with numerous practical applications ranging from autonomous driving
and surveillance to healthcare and augmented reality. Over the years, significant
progress has been made in the development of object detection systems, driven
by advancements in machine learning and deep learning approaches. These
approaches have revolutionized the field by enabling automated and accurate
detection of objects within visual data. In this literature survey, we review the
state-of-the-art techniques and methodologies in object detection, with a focus
on machine learning and deep learning approaches. We explore the evolution of
object detection algorithms, the challenges faced in the field, and the recent
advancements that have propelled the performance of object detection systems
to unprecedented levels. By synthesizing the existing body of research, we aim
to provide insights into the current landscape of object detection and identify
potential avenues for future research and development.

2.1 LITERATURE SURVEY WORKS

The evolution of object detection in images and videos has been shaped
by groundbreaking research in machine learning and deep learning. Ren et al.'s
Faster R-CNN introduced Region Proposal Networks (RPNs), streamlining
detection and enabling real-time performance. Redmon et al.'s YOLO reframed
detection as a regression problem, reducing computational complexity and
enhancing speed, ideal for applications like video surveillance. Liu et al.'s SSD
introduced single-stage detection, improving efficiency and enabling
deployment on resource-constrained devices. Lin et al.'s FPN addressed scale
variation by integrating multi-scale feature maps, enhancing object detection
across sizes. He et al.'s Mask R-CNN extended Faster R-CNN with instance
8
segmentation capabilities, broadening applications in medical imaging and
video editing. Recent advancements include YOLOv5, offering improved
performance and efficiency, and EfficientDet, achieving superior performance
with fewer parameters. DETR introduces an end-to-end Transformer-based
architecture for accurate and interpretable detections. CenterNet simplifies
detection by directly predicting object centers, suitable for resource-limited
scenarios. Cascade R-CNN iteratively refines object proposals for enhanced
detection quality, particularly in challenging scenarios. NAS-FPN automates
feature pyramid network design, enhancing performance and scalability. PANet
introduces a path aggregation network for precise instance segmentation,
improving object understanding.

Fig 2.1 illustrates how seminal works in object detection have advanced
performance and expanded applicability across diverse domains.

Fig 2.1 Existing system

9
2.2 RELATED WORKS

Related

Literature Survey Authors Year Venue Works

Faster R- CNN:
Towards Real-Time
Object Detection Ren, S.,
with Region He, K., 1. Fast R- CNN:
IEEE
Proposal Networks Girshick, 2016 Girshick, R.
TPAMI
R., &
Sun, J.

2. YOLOv4:
Redmon,
YOLOv3: An Bochkovskiy, A.,
J., &
Incremental 2018 arXiv Wang, C.
Farhadi,
Improvement Y., & Liao, H.
A.
Y. M.

EfficientDet: Scalable
2.
and Efficient Object Tan, M.,
EfficientNet: Tan,
Detection Pang, R., & 2020 CVPR
M., & Le, Q.
Le, Q.

10
2. Vision
DETR: End-
Carion, N., Transformers:
to-End Object 2020 ECCV
et al. Dosovitskiy, A., et
Detection
al.

2. Vision
DETR: End- Carion, N., Transformers:
to-End Object 2020 ECCV
et al. Dosovitskiy, A., et
Detection with al.
Transformers

CenterNet: Object
Detection with
Center- Aspects Mo, K., 2. FCOS:
2019 CVPR
Estimation et al. Tian, Z., et al.

He, K., et 2. PANet: Liu, S., et


Mask R-CNN 2017 ICCV
al. al.

2. YOLOv4:
SSD: Single Shot
Bochkovskiy, A.,
MultiBox Detector Liu, W.,
2016 ECCV Wang, C.
et al.
Y., & Liao, H.

Y. M.

RetinaNet: Focal Loss


for Dense Object Lin, T. Y., 2. DSSD: Fu,
2017 ICCV
Detection et al. C. Y., et al.

Table 2.1 Related works


11
2.3 KEY AREAS OF FOCUS

Let us delve deeper into each key area of focus for object detection systems
using machine learning and deep learning approaches shows in Fig 2.2:

Model Architectures: Object detection research delves into designing networks


that balance accuracy and efficiency by exploring various depths, widths, and
connectivity patterns. Architectural innovations like skip connections and
attention mechanisms optimize performance for challenges such as scale
variation and occlusion.

Data Collection and Annotation: Researchers emphasize collecting diverse


datasets covering various object categories, poses, and lighting conditions.
Augmentation techniques like rotation and flipping increase dataset diversity,
while precise annotation with bounding boxes ensures model efficacy.

Training Strategies: Effective strategies like transfer learning and semi-


supervised learning optimize model performance and minimize training time.
Techniques such as hyperparameter tuning and adaptive learning rates play
pivotal roles in enhancing training effectiveness.

Feature Representation: Feature extraction methods, including CNNs and


FPNs, capture hierarchical representations from input images effectively.
Attention mechanisms enhance feature representation by focusing on
informative regions, while spatial transformers improve localization accuracy.

Object Localization: Anchor-based methods utilize predefined anchor boxes


for accurate object localization, while anchor-free methods simplify the process
by directly regressing object centroids. Techniques for handling scale variation
12
and occlusion enhance localization performance in challenging scenarios.

Inference Speed and Efficiency: Real-time object detection systems demand


swift and efficient inference algorithms to adhere to strict latency requirements.
Techniques like network pruning, quantization, and model distillation reduce
model complexity without compromising accuracy. Hardware accelerators such
as GPUs and TPUs leverage parallel processing to expedite inference, while
distributed techniques further enhance efficiency.

Evaluation Metrics and Benchmarking: Standardized metrics like precision,


recall, mAP, and IoU, along with benchmark datasets like COCO and PASCAL
VOC, are essential for objectively evaluating object detection models. These
metrics ensure the development of reliable systems that meet the evolving
demands of diverse applications by driving advancements in the field.

Fig 2.2 Sample object detection system

13
2.4 LIMITATIONS

Data Dependency: Object detection models heavily rely on annotated training


data, which can be costly and time-consuming to collect. However, inadequate
representation of the target domain in training data can hinder real-world
performance.
Computationally Intensive: Deep learning models for object detection require
significant computational resources during training and inference, posing
challenges for deployment on resource-constrained devices.
Overfitting: Deep learning models are prone to overfitting, memorizing noise or
irrelevant patterns in training data. Techniques like dropout and weight decay
mitigate overfitting, but balancing model capacity and dataset size remains
challenging.
Difficulty with Small Objects and Occlusions: Object detection systems may
struggle with detecting small objects or those partially occluded by other objects,
impacting localization and classification accuracy.
Limited Interpretability: Deep learning models are often considered "black
box," making it difficult to interpret their decisions, especially in safety-critical
applications.
Domain Adaptation: Models trained on one dataset may not generalize well to
new environments. Techniques for domain adaptation aim to transfer knowledge
to new domains, but challenges remain in adapting to diverse environments.
Runtime Performance: Real-time object detection systems require fast
inference speeds while maintaining accuracy, necessitating optimization
techniques and hardware acceleration.
Handling Imbalanced Data: Imbalanced datasets can bias models, affecting
performance on minority classes. Techniques like class re-weighting and data
augmentation address class imbalance issues during training.

14
CHAPTER 3

SYSTEM DESIGN

3.1 SYSTEM ARCHITECTURE

The system architecture shows in Fig 3.1 for an object detection system
using deep learning from videos is a comprehensive framework designed to
manage the intricate processes involved in detecting objects within video
streams. At its core, the Data Acquisition Module serves as the system's entry
point, facilitating the capture of video data from various sources such as
cameras, video files, or live streams. Once acquired, the Frame Segmentation
Module meticulously extracts individual frames from the continuous video
stream, employing techniques like keyframe extraction or motion-based
segmentation to efficiently segment the video into manageable units. Following
segmentation, the Image Pre-processing Module meticulously prepares each
frame for analysis by applying a suite of transformations to enhance its quality
and standardize its format. This involves resizing frames, normalizing pixel
values, and reducing noise or artifacts that could impede accurate object
detection.

Subsequently, the Feature Extraction Module employs sophisticated deep


learning techniques, particularly convolutional neural networks (CNNs), to
automatically extract salient features from the pre-processed frames. These
features capture both low-level details like edges and textures and high-level
features related to object shapes and structures, enabling precise detection.
Finally, the Classification Module assigns labels or categories to the extracted
features, indicating the presence of specific objects within the frames. By
integrating these modules into a coherent architecture, the system can efficiently
process video data and accurately detect objects within video streams, paving
the way for a wide range of applications across domains such as surveillance,

15
Fig 3.1 System Architecture

3.2 PROCESS WORKFLOW

These computer-aided detection methods are found to be more efficient,


reliable, accurate, and less time-consuming as compared to the manual detecting
methods that were less efficient and more time-consuming. The process flow
includes the following steps explained in Fig 3.2, they are:

● Data Acquisition
● Frame Segmentation
● Data Pre-processing
● Feature Extraction
● Classification

Data Acquisition: This step involves gathering video data from various sources
like surveillance cameras or video files. Data augmentation techniques may be
16
employed to enhance system robustness by collecting videos from different
viewpoints or lighting conditions. Metadata such as timestamps may also be
collected for further analysis.

Frame Segmentation: Frame segmentation divides the continuous video stream


into individual frames, typically represented as images. Techniques may include
frame extraction at fixed intervals or advanced methods like keyframe selection
based on scene changes or motion detection algorithms.

Image Pre-processing: Pre-processing prepares frames for input into the object
detection model by resizing them to a uniform size, normalizing pixel values, and
applying filters to enhance quality. This ensures consistency and optimization of
input frames for the detection model's requirements.

Feature Extraction: Convolutional neural networks (CNNs) automatically learn


and extract relevant features from pre-processed frames. These networks capture
hierarchical representations of input data, enabling the system to represent objects
in video frames accurately.

Classification: This step assigns labels to extracted features, indicating the


presence of specific objects within frames. A classification model, such as a
SoftMax classifier, computes the probability distribution over different classes,
with the highest probability assigned to each detected object. These
computer-aided detection methods are found to be more efficient, reliable,
accurate, and less time-consuming.

17
3.3 DATA FLOW MODEL

The below represented diagrams are the data flow diagrams which
explains our work.

3.3.1 DFD LEVEL 0

In the level 0 shows in Fig 3.3 of the data flow diagram the basic flow of
the project. The input data contains the COCO database and the result is the
output data. The cleaned data will be used for the detection process. Then a
detailed analysis on detecting the object from video will be made on various
parameters. The process between them involves analysing and detection which
results in the images with the frames for the many purposes to detect the object.

Fig 3.3 DFD Level 0

3.3.2 DFD LEVEL 1

At Level 1 shows in Fig 3.4 of the Data Flow Diagram (DFD) for an
object detection system using deep learning from videos, the central module is

18
Object Detection System (Main), which oversees the entire process. This system
interfaces with two primary components: the Input Video File and the User
Interaction Interface. The Input Video File represents the source of video data,
which could be a video file stored locally or a video stream obtained from a
camera or other sources. On the other hand, the User Interaction Interface
facilitates interaction with the system, enabling users to provide input
parameters or view the system's output. Once the video data is received, it
undergoes Frame Pre-processing, a module responsible for preparing each
frame of the video for object detection.

Fig 3.4 DFD Level 1

This process involves tasks such as resizing, normalization, or applying


other transformations to enhance the quality of input frames for subsequent
analysis. After pre-processing, the frames are fed into the Object Detection
module, where deep learning techniques like YOLO (You Only Look Once) are
employed to detect objects present within each frame.

19
Display/Save module handles the presentation or storage of the detected objects.
It may display the detected objects to the user in real-time or save them to a file
for later analysis. This evaluation provides valuable insights into the system's
performance and helps in its refinement and optimization. Overall, these
interconnected components form a cohesive system for object detection from
videos using deep learning techniques.

3.3.3 DFD LEVEL 2

At Level 2 of the Data Flow Diagram (DFD) for an object detection


system using deep learning from videos, each module introduced at Level 1 is
expanded to detail its internal processes and interactions. The Input Video File
Module remains the source of video data, while the User Interaction Interface
Module continues to facilitate user interaction with the system. Within the
Frame Pre- processing Module, specific tasks such as resizing, normalization,
and metadata extraction from each frame are outlined to optimize frame quality.
The Object Detection Module is further elaborated to include the deep learning
model architecture employed, detailing the process of loading the pre-trained
model, executing object detection on each frame, and extracting relevant object
information. The Output Display/Save Module expands to specify supported
output formats and may include post-processing tasks for refining detected
object information. Additionally, the optional Performance Evaluation Module
may be detailed further to specify evaluation metrics and the comparison of
ground truth annotations with detected objects for quantitative analysis.
Interactions among these modules are delineated, illustrating the flow of data
and control within the system, including feedback loops for iterative refinement
and optimization. Overall, Level 2 of the DFD provides a comprehensive
blueprint of the internal workings and interactions of the object detection
system, enhancing understanding and facilitating system development and
refinement.

20
3.4 UML DIAGRAMS

The following UML diagrams are used to describe the project, they are:

• Use case diagram


• Activity diagram
• Sequence diagram

3.4.1 USE CASE DIAGRAM

There are three actors involved in the use case diagram are User,
Detection system, and database shows in Fig 3.5. The use cases associated with
users are uploading the object video and output is given from the detection
system. The use cases associated with database are data storage, data
management. The use cases associated with Detection system are frame
segmentation, processing the image, feature extraction and classification.

Fig 3.5 Use case Diagram

21
3.4.2 ACTIVITY DIAGRAM

Activity diagram shows in Fig 3.6 is a flowchart to represent the flow


from one activity to another activity. The activity diagram helps to visualize the
overall workflow of the Object detection process and highlights the key steps
involved in the detection of data using deep learning. The diagram can be used
to guide the development of a software system that supports the object detection
process and helps to improve the accuracy and efficiency of diagnosis. The
process flow starts with input data that contains object images, where the
required features are extracted for effective modelling.

Fig 3.6 Activity Diagram

22
3.4.3 SEQUENCE DIAGRAM

The below sequence diagram shows in Fig 3.7 the interactions between
the various components involved in the Object detection process. The lifeline
will be explained by the User, Detection system and Database. The User
uploads the video which is stored into the database. The process the image file
by frame segmentation and data pre-processing. The extracted features are then
used for machine learning classification, which uses algorithms to detect the
object in the video which is given by the user. Finally, an image of detection is
shown based on the machine learning results, summarizing the findings and
providing recommendations for further detection.

Fig 3.7 Sequence Diagram

23
3.5 WORKFLOW

Step 1: Gather video analysis parameters and objectives from users to tailor the
process.

Step 2: Read the video file or stream, ensuring compatibility and smooth data
ingestion.

Step 3: Pre-process frames by applying resizing, normalization, and other


transformations to enhance model performance.

Step 4: Initialize the object detection model, such as YOLO, ensuring proper
configuration and optimization.

Step 5: Train the model if required, leveraging labeled datasets for fine-tuning
and improved accuracy.

Step 6: Detect objects within each pre-processed frame using the initialized
model, capturing relevant features.

Step 7: Perform post-processing on detected objects, filtering out


low-confidence detections and redundant bounding boxes.

Step 8: Optionally visualize detected objects overlaid on the original frame for
intuitive analysis.

Step 9: Aggregate detected objects if tracking or temporal analysis is needed,


ensuring analysis coherence.

Step 10: Output detected objects' labels, confidence scores, and bounding box
coordinates, facilitating interpretation and action.

24
CHAPTER 4

SYSTEM IMPLEMENTATION

4.1 TRACED YOLO V7 MODEL

Implementing a YOLOv7 model shows in Fig 4.1 for object detection in


videos entails a systematic approach beginning with data acquisition. Video
data, sourced from cameras, files, or streams, undergoes segmentation into
individual frames, representing discrete moments within the video sequence.
These frames undergo image pre-processing, where techniques like resizing,
normalization, and noise reduction are applied to standardize their format and
enhance their quality, optimizing them for subsequent analysis. The crux of the
YOLOv7 model lies in feature extraction, achieved through a convolutional
neural network (CNN) backbone.

This backbone network extracts hierarchical features from the frames,


encompassing both fine-grained details and high-level semantics crucial for
precise object detection. Simultaneously, the model conducts classification,
wherein feature maps are processed by detection heads to predict bounding
boxes and class probabilities for objects within the frames. This integrated
approach enables YOLOv7 to efficiently detect objects across various scales
and aspect ratios, ensuring robust performance in real-time applications.

The seamless integration of the YOLOv7 model into the video processing
pipeline equips the system with the capability to accurately detect and classify
objects within video streams, bolstering its efficacy in domains like
surveillance, traffic monitoring, and video analytics. Through these steps, the
object detection system achieves a holistic understanding of the video content,
facilitating informed decision-making and enhancing situational awareness in
diverse scenarios.

25
Fig 4.1 YOLO V7 Model

4.2 WORKING PRINCIPLES OF TRACED YOLO V7 MODEL

The working principles of the Traced YOLOv7 model for object


detection in videos align with the specified steps of the object detection system
using deep learning from videos:

Data Acquisition: Initially, video data is obtained from diverse sources


like cameras or video files, serving as the input for object detection. This step is
crucial for identifying objects within the video frames accurately.

Frame Segmentation: Following data acquisition, the video stream undergoes


segmentation into individual frames. Each frame represents a distinct snapshot of
the video, enabling independent analysis for object detection purposes.

Image Pre-processing: Before inputting frames into the Traced YOLOv7


model, pre-processing steps are executed to improve their quality and usability.
Tasks may include resizing frames, normalizing pixel values, and reducing noise
to enhance object detection accuracy.

26
Feature Extraction: The Traced YOLOv7 model employs a convolutional
neural network (CNN) backbone to automatically extract pertinent features
from pre-processed frames. These features capture critical object characteristics
such as edges, textures, and shapes.

Classification: Alongside object detection, the Traced YOLOv7 model performs


classification, predicting bounding boxes and class probabilities for detected
objects within frames. This enables accurate identification and classification of
objects in the video stream.

4.3 APPROACH

The approach of the Traced YOLOv7 model shows in Fig 4.2 involves a
systematic process geared towards enabling efficient deployment and real-time
object detection in videos. Initially, the YOLOv7 model undergoes
comprehensive training using a diverse dataset of annotated images and videos,
where it learns to accurately detect and classify objects within the visual
datathrough optimization of its parameters based on defined loss functions.
Following training, the model is subjected to optimization techniques aimed at
streamlining its architecture for deployment on edge devices. This optimization
phase typically encompasses model quantization, pruning, and compression to
reduce its size and computational complexity while maintaining performance
integrity.

27
Fig 4.2 Approach

Following optimization, the YOLOv7 model undergoes tracing to


TensorFlow Lite for edge device compatibility, enabling real-time object
detection. Integrated into video processing pipelines, it efficiently operates on
individual frames, performing preprocessing, feature extraction, and
classification for object detection. With its optimized design and edge device
integration, the Traced YOLOv7 model ensures low-latency inference, crucial
for applications such as surveillance, video analytics, and autonomous vehicles.
Finally, the model is deployed on edge devices, where its lightweight and
efficient architecture make it well-suited for resource-constrained environments,
ensuring accurate and timely object detection in real-world scenarios. Through
this methodical approach, the Traced YOLOv7 model empowers various
applications with robust and efficient object detection capabilities in video data.

28
4.4 ADVANTAGES OF USING TRACED YOLO V7
MODEL CLASSIFIER OVER OTHER CLASSIFIERS

The Traced YOLOv7 model classifier offers several advantages over


other classifiers, making it a preferred choice for object detection tasks in
videos:

Real-Time Performance: The Traced YOLOv7 model is optimized for real-


time performance, enabling it to process video frames and detect objects with
low latency. This real-time capability is crucial for applications such as
surveillance, where timely detection of objects is essential.

Efficient Deployment on Edge Devices: The Traced YOLOv7 model is


designed for deployment on edge devices such as mobile phones, drones, or
embedded systems. Its lightweight architecture and efficient inference make it
well-suited for resource-constrained environments, allowing for on-device
processing without reliance on cloud servers.

Accurate Object Detection: The YOLOv7 architecture, upon which the Traced
YOLOv7 model is based, is known for its high accuracy in object detection
tasks. By leveraging advanced deep learning techniques and multi-scale feature
extraction, the model can accurately detect objects of varying sizes and aspect
ratios within video frames.

Single Stage Detection: Unlike traditional two-stage detectors, such as Faster


R- CNN, which require separate region proposal and object classification
stageYOLOv7 performs object detection in a single stage. This simplifies the
detection pipeline and reduces inference time, resulting in faster processing of
video data.

29
Multi-Object Detection: The Traced YOLOv7 model excels at detecting
multiple objects within a single frame simultaneously. This capability is
particularly advantageous in crowded scenes or scenarios where numerous
objects need to be detected and classified concurrently.

End-to-End Training: YOLOv7 models, including the Traced variant, can be


trained end-to-end on large datasets, facilitating seamless integration of domain-
specific features, and improving overall detection performance.

Flexibility and Customization: The Traced YOLOv7 model offers flexibility


and customization options, allowing developers to fine-tune model parameters,
optimize performance, and adapt to specific application requirements.

4.5 STEPS INVOLVED OBJECT DETECTION SYSTEM

The Steps involved in object detection system explained in Fig 4.3 are,

Fig 4.3 Object Detection System

30
4.5.1 DATA ACQUISITION

Data acquisition for an object detection system using deep learning from
videos is a crucial process that involves gathering video data from various
sources to create a comprehensive dataset for model training and evaluation.
Initially, the identification of suitable data sources, including surveillance
cameras, online repositories, or recorded video streams, sets the stage for
collecting the necessary footage. Prior to acquisition, it's imperative to address
legal considerations and obtain permissions if the videos contain sensitive
information. Once obtained, the videos undergo annotation and labelling to
mark the presence and location of objects within frames, a vital step for
supervised learning. Following annotation, the dataset may undergo pre-
processing tasks such as resizing, format conversion, or noise reduction to
ensure consistency and quality. Subsequently, the dataset is divided into
training, validation, and testing sets to facilitate model training and evaluation.
Additionally, data augmentation techniques may be applied to increase dataset
diversity and improve model generalization. Ultimately, well- structured storage
and management of the annotated and pre-processed video dataset are
paramount for seamless access during model development and deployment.
Through meticulous data acquisition, an object detection system can be trained
effectively, enhancing its ability to accurately identify objects within video
streams.

4.5.2 FRAME SEGMENTATION

Frame segmentation, a key step in video-based object detection systems,


involves dividing a continuous video stream into individual frames, each
representing a distinct moment for analysis. This process enables efficient
object detection by allowing the model to focus on one frame at a time.
Techniques like uniform sampling or motion-based segmentation ensure

31
representative frames are extracted while minimizing computational load.
Adjusting the frame rate balances resources with temporal information.
Segmented frames are then organized for further processing, facilitating the
application of object detection algorithms in tasks like surveillance and
autonomous navigation.

4.5.3 IMAGE PRE-PROCESSING

Image pre-processing is a critical step in preparing video frames for


object detection using deep learning techniques. It involves a series of
operations aimed at enhancing the quality of the frames and standardizing their
format to optimize subsequent processing. In the context of object detection
systems from videos, image pre-processing typically includes the following
steps:

Resizing frames to a uniform size ensures consistency and efficient


processing by the deep learning model, preventing issues from varying aspect
ratios or resolutions. Additionally, normalization standardizes pixel values
across frames, stabilizing the training process by bringing values within a
specific range, typically 0 to 1 or -1 to 1. This consistency ensures consistent
weight updates during training, optimizing model performance.

Additionally, reducing noise and artifacts in the frames is crucial for


improving the quality of input data. Techniques such as denoising filters or
image smoothing algorithms may be applied to remove unwanted noise and
enhance the clarity of objects in the frames.Furthermore, adjusting the color
space or applying color corrections may be necessary to ensure consistency in
color representation across frames. This step helps in mitigating variations in
lighting conditions or camera settings, which can affect the performance of the
object detection model.Finally, data augmentation techniques such as rotation,
flipping, or adding random perturbations may be applied to augment the dataset.

32
4.5.4 FEATURE EXTRACTION

Feature extraction in an object detection system with the Traced YOLOv7


model involves several key steps. Firstly, pre-processed video frames are
inputted into the model, which acts as both a feature extractor and a detector.
Leveraging a convolutional neural network (CNN), the model extracts
hierarchical features capturing spatial information, textures, and patterns.

As frames pass through the CNN backbone, feature maps are generated at
multiple levels, encoding representations of object presence and spatial
relationships. Additionally, feature pyramid networks (FPNs) aggregate these
maps, enhancing detection across various scales and sizes.

Utilizing anchor boxes and prediction layers, the model generates bounding
boxes and confidence scores for detected objects based on extracted features,
facilitating object localization and presence indication.

Moreover, temporal information from consecutive frames can be integrated using


recurrent neural networks (RNNs) or 3D convolutional networks, enabling
motion pattern recognition and improved object tracking within videos.

4.5.5 SPLITTING THE DATA INTO TRAINING AND


TESTING DATASETS

Before building the model, separate the data into two parts, one is
Training data and another is Test data. shows the diagrammatic representation of
Fig 4.4 how data is split into training and testing which is used to obtain the
detection of objects from videos.

33
4.5.6 CLASSIFICATION

In the classification step of object detection using the Traced YOLOv7 model,
features are first extracted from pre-processed video frames, capturing spatial
information, textures, and patterns. These features are then passed through
detection heads within the Traced YOLOv7 architecture.

Utilizing anchor boxes and prediction layers, the model generates bounding
boxes enclosing detected objects, alongside confidence scores indicating object
presence likelihood. Subsequently, the classifier assigns class probabilities to
each detected object based on learned feature representations, ensuring accurate
classification into predefined categories.

To refine detections, post-processing techniques like non-maximum suppression


(NMS) may be employed to filter redundant bounding boxes. Additionally, the
classifier can utilize temporal information from consecutive frames to enhance
classification accuracy by considering motion patterns.

Following this classification process, model evaluation proceeds to RESNET


101, a deep neural network architecture, for further assessment and refinement.

4.6 EVALUATION METRICS

After the classification step with the Traced YOLOv7 model classifier,
the detected objects and their corresponding bounding boxes undergo further
evaluation using a pre-trained ResNet-101 model. ResNet-101, a deep
convolutional neural network architecture known for its exceptional
performance in various computer vision tasks, is employed to perform several
key tasks. Firstly, it recognizes and classifies the detected objects within the
video frames, leveraging its deep architecture and learned feature
representations to accurately identify objects based on visual characteristics

34
ResNet-101 facilitates fine-grained classification, distinguishing closely
related categories and offering detailed insights into video frame content.
Additionally, it extracts high-level features from detected objects, capturing
crucial semantic and contextual cues for comprehensive understanding.
Furthermore, ResNet-101 performs semantic segmentation, assigning each pixel
to a specific object class, enabling detailed spatial analysis. Finally, evaluation
of the ResNet-101 output utilizes metrics like Mean Average Precision (mAP)
to assess performance.object detection system accurately. By combining the
capabilities of Traced YOLOv7 for initial detection and ResNet-101 for detailed
evaluation, the object detection system achieves robust and reliable object
recognition in diverse video scenarios.

4.7 LIMITATIONS OF YOLO V7


YOLO v7 is a powerful and effective object detection algorithm, but it does
have a few limitations.

• YOLO v7, like many object detection algorithms, struggles to detect


small objects. It might fail to accurately detecting objects in
crowded scenes or when objects are far away from the camera.
• YOLO v7 is also not perfect at detecting objects at different scales. This
can make it difficult to detect objects that are either very large or very
small compared to the other objects in the scene.
• YOLO v7 can be sensitive to changes in lighting or other
environmental conditions, so it may be inconvenient to use in
real-world applications where lighting conditions may vary.
• YOLO v7 can be computationally intensive, which can make it difficult
to run in real-time on resource-constrained devices like smartphone.

35
CHAPTER 5

TESTING

Testing for object detection systems using deep learning from videos is a
crucial phase in assessing the performance and reliability of such systems
before deployment in real-world scenarios. This testing process involves
evaluating the model's ability to accurately detect and classify objects within
video frames, ensuring robustness and effectiveness across diverse conditions
and scenarios. By subjecting the object detection system to rigorous testing,
developers can identify potential weaknesses, optimize model parameters, and
enhance overall performance.

Testing encompasses various aspects, including dataset preparation,


evaluation metrics, and validation procedures, all aimed at validating the
system's efficacy and ensuring its suitability for practical applications. Through
comprehensive testing, object detection systems can be refined and validated,
paving the way for their deployment in critical domains such as surveillance,
video analytics, and autonomous systems. This introduction sets the stage for
understanding the importance and objectives of testing in object detection
systems using deep learning from videos.

5.1 TEST CASES

Our work is divided into five primary sections for testing shows in Table
5.1. The loading dataset, Frame segmentation, Image pre-processing, Feature
extraction, and Classification are the five modules. The test cases for the
modules and submodules have been checked and passed. The test case id, test
case scenario, test case secondary considerations, and test case state are all
considered when testing our work. Test cases are validated, and the outcomes
and status for the scenarios are handled.

36
TC SCENARIO SECONDARY CONSIDERATION STATE
ID

TC01 Loading Huge amount of Object video data are Pass


Dataset loaded for the food analysis system

TC02 Food Converting video data into frames Pass


segmentation of images

TC03 Pre-processing Preprocess the framed image data Pass

TC05 Feature Check for features from the pre-processed Pass


Extraction data

TC06 Classification Check classifier to detect the object Pass

Table 5.1 Testcases

37
CHAPTER 6

FUTURE WORK

Some potential future directions for improving the object detection system
based on the YOLO V7 model:

Real-Time Performance Optimization: Further optimize the model for real-


time performance across different hardware platforms to enable faster inference
speeds.

Multi-Object Tracking: Integrate multi-object tracking algorithms to detect


and track objects across consecutive frames, useful in surveillance and sports
analytics.

Improved Accuracy and Generalization: Continuously refine the model


architecture and training process to enhance detection accuracy and
generalization in diverse environments.

Semantic Segmentation Integration: Explore integrating semantic


segmentation techniques to improve object localization and scene
understanding.

Efficient Deployment: Develop strategies like model compression and


quantization for deployment on resource-constrained devices.

Domain Adaptation: Investigate techniques to adapt the model to specific


target domains for improved performance.

Uncertainty Estimation: Incorporate uncertainty estimation techniques to


provide confidence scores for detected objects, aiding decision-making.

Spatiotemporal Analysis: Extend the model to analyze spatiotemporal


information in videos for tasks like action recognition and anomaly detection.

38
APPENDIX A

SAMPLE SCREENSHOTS

The Object Detection and Tracking System UI presents an intuitive


interface for users interested in analyzing objects within videos. Users initiate
the analysis process by selecting the "detect" option, triggering the object
detection mechanism illustrated showed on Fig A.1, A.2, A.3, A.4, A.5,
A.6&A.7. Upon selection, users are prompted to upload a video for object
detection. Utilizing cutting-edge deep learning and machine learning algorithms,
the system proceeds to identify and visually display the detected objects within
the uploaded video. Once the objects are recognized, users are invited to input
specific inquiries related to the detected objects. These inquiries are then
processed by the system's advanced algorithms, incorporating sophisticated
natural language processing methodologies. Consequently, the system generates
tailored outputs in response to the inquiries, providing users with detailed
insights, tracking information, classifications, or any other relevant data
pertaining to the detected objects. Through this seamless process, the Object
Detection and Tracking System offers users a comprehensive and interactive
platform for exploring and understanding objects within videos.

Fig A.1 Landing Page

39
Fig A.2 Upload a video

Fig A.3 Model output download

40
OUTPUT

Fig A.4 Output 1

Fig A.5 Output 2

41
Fig A.6 Output 3

Fig A.7 Output 4

42
APPENDIX B
SAMPLE CODING
In this appendix, we outline the code snippets employed in our
development and evaluation of an AI-powered object detection system for video
analysis. Users are guided to input essential parameters and upload videos for
object detection. Noteworthy is the incorporation of a specialized deep learning
model that allows users to interact with the system's output, enabling detailed
analysis of detected objects. Moreover, the implementation includes robust
error- handling mechanisms to ensure a smooth user experience and precise
detection results, thereby improving the system's robustness and user
satisfaction.

Yolo.py:

import argparse

import logging

import sys

from copy import deepcopy

sys.path.append('./') # to run '$ python *.py' files in subdirectories

logger = logging.getLogger( name )

import torch

from models.common import *

from models.experimental import *

from utils.autoanchor import check_anchor_order

from utils.general import make_divisible, check_file, set_logging

43
from utils.torch_utils import time_synchronized, fuse_conv_and_bn,
model_info, scale_img, initialize_weights, \

select_device, copy_attr

from utils.loss import SigmoidBin

try:

import thop # for FLOPS computation

except ImportError:

thop = None

class Detect(nn.Module):

stride = None # strides computed during build

export = False # onnx export

end2end = False

include_nms = False

concat = False

def init (self, nc=80, anchors=(), ch=()): # detection layer

super(Detect, self). init ()

self.nc = nc # number of classes

self.no = nc + 5 # number of outputs per anchor

self.nl = len(anchors) # number of detection layers

self.na = len(anchors[0]) // 2 # number of anchors

self.grid = [torch.zeros(1)] * self.nl # init grid

44
a = torch.tensor(anchors).float().view(self.nl, -1, 2)
self.register_buffer('anchors', a) # shape(nl,na,2)
self.register_buffer('anchor_grid', a.clone().view(self.nl, 1, -1, 1, 1, 2)) #
shape(nl,1,na,1,1,2)

self.m = nn.ModuleList(nn.Conv2d(x, self.no * self.na, 1) for x in ch) #


output conv

def forward(self, x):

# x = x.copy() # for profiling

z = [] # inference output

self.training |= self.export

for i in range(self.nl):

x[i] = self.m[i](x[i]) # conv

bs, _, ny, nx = x[i].shape # x(bs,255,20,20) to x(bs,3,20,20,85)

x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4,


2).contiguous()

if not self.training: # inference

if self.grid[i].shape[2:4] != x[i].shape[2:4]:

self.grid[i] = self._make_grid(nx, ny).to(x[i].device)

y = x[i].sigmoid()

if not torch.onnx.is_in_onnx_export():

y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i]) * self.stride[i] # xy

45
include_nms = False

concat = False

def init (self, nc=80, anchors=(), ch=()): # detection layer

super(IDetect, self). init ()

self.nc = nc # number of classes

self.no = nc + 5 # number of outputs per anchor

self.nl = len(anchors) # number of detection layers

self.na = len(anchors[0]) // 2 # number of anchors

self.grid = [torch.zeros(1)] * self.nl # init grid

a = torch.tensor(anchors).float().view(self.nl, -1, 2)
self.register_buffer('anchors', a) # shape(nl,na,2)
self.register_buffer('anchor_grid', a.clone().view(self.nl, 1, -1, 1, 1, 2)) #
shape(nl,1,na,1,1,2)

self.m = nn.ModuleList(nn.Conv2d(x, self.no * self.na, 1) for x in ch) #


output conv

self.ia = nn.ModuleList(ImplicitA(x) for x in ch)

self.im = nn.ModuleList(ImplicitM(self.no * self.na) for _ in ch)

def forward(self, x):

# x = x.copy() # for profiling

z = [] # inference output

self.training |= self.export

46
x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4,
2).contiguous()

if not self.training: # inference

if self.grid[i].shape[2:4] != x[i].shape[2:4]:

self.grid[i] = self._make_grid(nx, ny).to(x[i].device)

y = x[i].sigmoid()

if not torch.onnx.is_in_onnx_export():

y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i]) * self.stride[i] # xy

y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i] # wh

else:

xy, wh, conf = y.split((2, 2, self.nc + 1), 4) # y.tensor_split((2, 4,

5), 4) # torch 1.8.0

xy = xy * (2. * self.stride[i]) + (self.stride[i] * (self.grid[i] - 0.5)) #


new xy

wh = wh ** 2 * (4 * self.anchor_grid[i].data) # new wh

y = torch.cat((xy, wh, conf), 4)

z.append(y.view(bs, -1, self.no))

if self.training:

out = x

elif self.end2end:

out = torch.cat(z, 1)

47
args[1] = [list(range(args[1] * 2))] * len(f)

elif m is ReOrg:

c2 = ch[f] * 4

elif m is Contract:

c2 = ch[f] * args[0] ** 2

elif m is Expand:

c2 = ch[f] // args[0] ** 2

else:

c2 = ch[f]

m_ = nn.Sequential(*[m(*args) for _ in range(n)]) if n > 1 else m(*args) #


module

t = str(m)[8:-2].replace(' main .', '') # module type

np = sum([x.numel() for x in m_.parameters()]) # number params

m_.i, m_.f, m_.type, m_.np = i, f, t, np # attach index, 'from' index, type,


number params

logger.info('%3s%18s%3s%10.0f %-40s%-30s' % (i, f, n, np, t, args)) #


print

save.extend(x % i for x in ([f] if isinstance(f, int) else f) if x != -1) #


append to savelist

layers.append(m_)

if i == 0:

ch = []

48
ch.append(c2)

return nn.Sequential(*layers), sorted(save)

if name == ' main ':

parser = argparse.ArgumentParser()

parser.add_argument('--cfg', type=str, default='yolor-csp-c.yaml',


help='model.yaml')

parser.add_argument('--device', default='', help='cuda device, i.e. 0 or 0,1,2,3


or cpu')

parser.add_argument('--profile', action='store_true', help='profile model


speed')

opt = parser.parse_args()

opt.cfg = check_file(opt.cfg) # check file

set_logging()

device = select_device(opt.device)

# Create model

model = Model(opt.cfg).to(device)

model.train()

if opt.profile:

img = torch.rand(1, 3, 640, 640).to(device)

y = model(img, profile=True)

49
APPENDIX C

SYSTEM REQUIREMENTS

In this appendix system requirements, which will be most


important for our work to process are mentioned. We have mentioned the
hardware specifications, software specification and the browsers and the
source of the object data which we used are mentioned below.

HARDWARE SPECIFICATION

● Processor (CPU) with 8GB RAM.


● Internet Connection
● Keyboard and Mouse or some other compatible pointing device

SOFTWARE SPECIFICATION

● Windows 10 or Higher
● Visual Studio Code

BROWSERS

● Chrome
● Edge
● Mozilla Firefox
● Internet Explorer
● Safari

DATASETS

● COCO

50

50

You might also like