0% found this document useful (0 votes)
17 views

Pattern Summary Final

dqwe qwe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Pattern Summary Final

dqwe qwe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Pattern Recognition Summary

Lecture 1
❖ Supervised Learning has training set, features (predictors, or input), and outcome
(response or output).
❖ Unsupervised Learning observes only the features and has no outcome. We need to
cluster data or organize it.
❖ A pattern is a set of objects, processes or events which consist of both deterministic
and stochastic components.
❖ Machine learning is a method of teaching computers to learn from data.
o Pattern recognition and machine learning fields can be used to create systems
that can automatically detect and respond to patterns in data.
❖ Machine Learning vs Pattern Recognition:

Machine Learning Pattern Recognition


method of teaching computers to learn The process of identifying patterns in data
from data
has origins in Computer Science has origins in Engineering

❖ Detection vs Description:

Detection Description
something happened what has happened?
Examples: Heard noise, Saw something Examples: Gun shot, talking, laughing,
interesting, non-flat signals crying, etc
❖ Features: The intrinsic traits or characteristics that tell one pattern (object) apart from
another
o Allows Focus on relevant, distinguishing parts of a pattern and Data reduction
and abstraction.
❖ Importance of Features:
o Cannot be overstated.
o We usually don’t know which to select, what they represent, and how to tune
them.
o Classification and regression schemes are mostly trying to make the best of
whatever features are available.
o One feature is usually not descriptive.
▪Lack of features may be due to Relevance, Missing values, Dimensionality,
Time, and space varying characteristics.
❖ We can decide if a feature is effective through a training phase.
❖ Feature space: D dimensional (D the number of features) populated with features from
training samples.
❖ Decision boundary methods
o Learn the separation in the feature space.
o Examples: Cluster Centers, Decision Surfaces
❖ Parametric methods:
o Based on class sample exhibiting a certain parametric distribution.
o Learn the parameters through training.
o Example: Gaussian.
❖ Density methods:
o Does not enforce a parametric form.
o Learn the density function directly.
❖ A deterministic model is a model that assumes that the outcome of a system or process
is fully determined by its initial conditions and parameters
o does not involve any randomness or uncertainty.
o always produces the same result for the same input.
o can be useful when the system or process is well-understood, predictable, and
stable, and when the accuracy and precision of the model are important.
o Example: Crystal Structure.
❖ A stochastic model is a model that incorporates some elements of randomness or
uncertainty into the system or process.
o does not assume that the outcome of a system or process is fully determined
by its initial conditions and parameters.
o It can vary according to some probability distribution or function.
o can be useful when the system or process is complex, dynamic, and
unpredictable, and when the variability and distribution of the model are
important.
o Example: White Noise.
❖ Statistical Tests:
o t-test: Tests for the difference between the means of two independent groups.
o ANOVA: Tests for the difference between the means of three or more groups.
o F-test: Compares the variances of two groups.
o Chi-square test: Tests for relationships between categorical variables.
o Correlation analysis: Measures the strength and direction of the linear
relationship between two continuous variables.
❖ Machine Learning Models:
o Linear regression: Predicts a continuous outcome based on a linear relationship
with one or more independent variables.
o Logistic regression: Predicts a binary outcome based on a set of independent
variables.
o Naive Bayes: Classifies data points based on Bayes’ theorem and assuming
independence between features.
o Hidden Markov Models: Models sequential data with hidden states and
observable outputs.
❖ Density-based clustering algorithms: can deal with non-hyper spherical clusters and are
robust to outliers.
❖ Traditional vs Modern Pattern Recognition:

Traditional Modern
Hand-crafted features Automatically learned features

Simple and low-level concatenation of Hierarchical and complex


numbers or traits

Syntactic Semantic
Feature detection and description are Feature detection and description are not
separate tasks from classifier design jointly optimized with classifiers

❖ Error rate refers to a measure of the degree of prediction error of a model made with
respect to the true model
❖ Two routes of Bayes Rule:
o Forward (synthesis) route: From class to sample in a class
o Backward (analysis) route: From sample to class ID (always harder).
❖ In bayes rule, we turn a backward (analysis) problem into several forward (synthesis)
problems, also known as analysis-by-synthesis.
❖ Types of errors:
o True positive the model correctly predicts the positive class.
o True negative the model correctly predicts the negative class.
o False positive the model incorrectly predicts the positive class.
o False negative the model incorrectly predicts the negative class.
❖ Various ways to measure error rate:
o Training & Testing error (under your control)
o Empirical error (generalization Error)
❖ Precision vs Recall: one goes up and the other HAS to go down.
❖ Mean vs Median:

Mean Median
- Traditional measure of center - A resistant measure of the data’s center

- Sum the values and divide by the number - At least half of the ordered values are
of values less than or equal to the median value
- At least half of the ordered values are
- Computation of mean is easier
greater than or equal to the median
- prone to noise value
- If n is odd, the median is the middle-
ordered value
- If n is even, the median is the average of
the two middle ordered values

- Finding median in higher dimension is


much complex

❖ The mean and median of data from a symmetric distribution should be close together.
❖ Spread (Variability): exists when some values are different from (above or below) the
mean.
❖ Quartiles: Three numbers which divide the ordered data into four equal sized groups.
❖ Variance vs Covariance:

Variance Covariance
the average squared deviation from Measure of how much each of the dimensions
the mean of a set of data. vary from the mean with respect to each other
Measure of the deviation from the measured between two dimensions
mean for points in one dimension
used to find the standard deviation sees if there is a relation between two dimensions

❖ Correlation is another way to determine how two variables are related.


❖ Covariance between one dimension is the variance.
❖ Covariance Types:
o Positive Covariance: Both dimensions increase or decrease together.
o Negative Covariance: one increases the other decreases.
❖ In addition to whether variables are positively or negatively related, correlation also
tells the degree to which the variables have related each other.
Lecture 2
❖ Dimensionality Reduction: way to simplify complex high-dimensional data that
Summarize data with a lower dimensional real valued vector.
❖ Dimensionality Reduction Solutions:
o Multi-Dimensional Scaling:
▪ Preserve distance measures.
▪ Find projection that best preserves inter-point distances.
o Principal Component Analysis (PCA)
▪ best data representation (not necessarily best separation)
▪ Find projections that maximize the variance.
o ICA (Independent Component Analysis):
▪ Very similar to PCA except that it assumes non-Gaussian features
o Fisher’s Linear Discriminant:
▪ Preserve class separation (special case of PCA)
▪ Maximizing the component axes for class-separation
❖ Feature vectors represent features used by machine learning models in multi-
dimensional numerical values.
o Other Definition: A feature vector is an ordered list of numerical properties of
observed phenomena.
o As machine learning models can only deal with numerical values, converting any
necessary features into feature vectors is crucial.
❖ Quantitative Data vs Qualitative Data:

Qualitative Quantitative
Categorical Numerical
Humans can analyze qualitative data to machine learning models can only deal
make a decision with quantitative data

Examples: Examples:
- Gender - Age
- Religion - Height
- Marital status - Weight
- Qualifications - Income
❖ The linear model is one of the simplest models in machine learning. It assumes that the
data is linearly separable and tries to learn the weight of each feature.
o We can view linear classification models in terms of dimensionality reduction.
❖ Intuitively, good features are those with large separation of means relative to
variances.
❖ Fisher’s Linear Discriminant:
o Selects a projection that maximizes the class separation. To do that, it
maximizes the ratio between the between-class variance to the within-class
variance.
o to project the data to a smaller dimension and to avoid class overlapping, it
maintains two properties:
▪ A large variance among the dataset classes, so that the projected class
averages should be as far apart as possible.
▪ A small variance within each of the dataset classes, so that a small within-
class variance has the effect of keeping the projected data points closer
to one another.
▪ To find the projection within the properties, it learns a weight vector
▪ Can be used as a supervised learning classifier.
Lecture 3
❖ Clustering: One way to summarize a complex real-valued data point with a single
categorical variable.
❖ Principal Component Analysis (PCA): An exploratory technique used to reduce the
dimensionality of the data set to 2D or 3D, can be used to:
o Reduce the number of dimensions in data.
o Find patterns in high-dimensional data.
o Visualize data of high dimensionality.
o Examples:
▪ Face recognition
▪ Image compression
▪ Gene expression analysis
❖ PCA steps to reduce dimensionality to 𝑟-dim:
o Compute Mean Vector 𝜇 and covariance matrix ∑ of original points.
o Compute eigenvectors and eigenvalues of ∑.
o Select top 𝑟 eigenvectors.
o Project points into subspace spanned by them: 𝑦 = 𝐴(𝑥 − 𝜇) where 𝑦 is the new
point, 𝑥 is the old one, and 𝐴 are the eigenvectors.
❖ Eigenvectors (𝜆) vs Eigenvalues (𝑥):

Eigenvectors Eigenvalues
those vectors that are only stretched, the factor by which an eigenvector is
with no rotation or shear stretched or squished
does not change direction in a The value zero can be eigenvalue
transformation
Called characteristic vector Called characteristic value
The zero vector cannot be an eigenvector. • 1 means no change,
• 2 means doubling in length,
• −1 means pointing backwards
Formula: det(𝐴 − 𝜆𝐼) = 0 Formula: 𝐴𝑥 = 𝜆𝑥
* 𝐴 is a square matrix and 𝐼 is the identity matrix.

❖ A vector can be an eigenvector of A if and only if B does not have an inverse, or


equivalently det(B)=0.
❖ We say that 2 vectors are orthogonal if they are perpendicular to each other (the dot
product of the two vectors is zero).
❖ Bases of a vector space: a set of vectors in that space that can be used as coordinates
for it. The set must:
o span the vector space.
o be linearly independent.
❖ Principal Component 1 (PC1):
o The eigenvalue with the largest absolute value will indicate that the data have
the largest variance along its eigenvector, the direction along which there is
greatest variation.
o only a few directions manage to capture most of the variability in the data.
❖ PCA Disadvantages:
o While PCA simplifies the data and removes noise, it always leads to some loss of
information when we reduce dimensions.
o PCA is a linear combination dimensionality reduction technique, but not all real-
world datasets may be linear.
Lecture 4
❖ Parameter estimation is defined as the experimental determination of values of
parameters that govern the system behavior, assuming that the structure of the
process is known.
❖ Discrete distribution: data can only take on certain values.
o probabilities can be assigned to the values in the distribution.
❖ Continuous Distribution: data can take on any value within a specified range
o normally described in terms of probability density, which can be converted into
the probability that a value will fall within a certain range.
❖ Parameter Estimation Approaches:
o Parametric:
▪ Algorithms that simplify the function to a known form.
▪ Assume a certain parametric form and estimate the parameters.
▪ A learning model that summarizes data with a set of parameters of fixed
size is called a parametric model.
▪ Advantages: Simpler, speed, less data.
▪ Disadvantages: constrained, limited complexity, and poor fit.
o Nonparametric:
▪ Algorithms that do not make strong assumptions about the form of the
mapping function.
▪ good when you have a lot of data and no prior knowledge, and when you
don’t want to worry too much about choosing just the right features.
▪ does not assume a parametric form and estimate the density profile
directly.
▪ Example: K-NN, Decision Trees, SVM
▪ Advantages: Flexibility, Power, Performance.
▪ Disadvantages: More data, slower, and overfitting.
o Boundary: estimate the separation hyperplane (hypersurface) between both.
❖ Maximum Likelihood Estimator:
o Batch estimator.
o Parameters have fixed but unknown values.
o The maximum likelihood estimator of the mean is the sample mean that is the
estimate of 𝜇 is the average value of all the data points.
❖ Bayesian estimator:
o parameters as random variables with a prior distribution.
o Allows us to change the a priori distribution by incorporating measurements to
sharpen the profile.
❖ Given the numbers of occurrence: if number of samples are large enough, the selection
process is not biased.
o Caveat: sampling may be biased
❖ Maximum A Posteriori (MAP): Like MLE with one additional twist:
o MLE has a uniform prior, MAP not necessarily.
❖ MLE vs Bayesian Estimator:

MLE Bayesian
Faster (differentiation) Slow (integration)
Single model Multiple weighted
Known model Unknown model fine
Less information More information (nonuniform prior)

❖ Bayesian classifier and MAP will in general give different results when used to classify
new samples.
❖ Bayesian classifier is optimal, but can be very expensive, especially when many
hypotheses are kept and evaluated.
❖ Gibbs: randomly pick one hypothesis according to the current posterior.
Lecture 5
❖ Supervised Learning:
o Discover patterns in the data with known target (class) or label.
o These patterns are then utilized to predict the values of the target attribute in
future data instances.
❖ Unsupervised Learning:
o The data have no target attribute.
❖ Clustering: Task of grouping a set of data points such that data points in the same
group are more similar to each other, each group is known as a cluster.
o A cluster is represented by a single point, known as centroid.
▪ Centroid is computed as the means of all data points in a cluster.
▪ Cluster boundary is decided by the farthest data point in the cluster.
o The goals of clustering:
▪ Group data that are close (or similar) to each other.
▪ Identify such groupings (or clusters) in an unsupervised manner.
❖ Clustering Types:
o Exclusive Clustering: K-Means.
▪ Basic Idea: randomly initialize the k cluster centers and determine points
in each cluster by the closest one to the point.
▪ Properties: always coverage to some solution and can be “local
minimum”
▪ Cons: sensitive to initial centers and outliers and assumes that means
can be computed.
o Overlapping Clustering: Fuzzy C-Means.
▪ Each data point is separated into different clusters and then assigned a
probability score for being in that cluster.
▪ Pros:
• Allows a data point to be in multiple clusters.
• gives better results for overlapped data sets compared to k-
means clustering.
▪ Cons:
• Need to define the number of clusters.
• Sensitive to initial assignment of centroids. (not deterministic)
o Hierarchal Clustering: Agglomerative Clustering, Divisive Clustering.
▪ Produces a nested sequence of clusters, a tree, also called dendrogram.
▪ Agglomerative (bottom-up) “more popular”: builds the dendrogram
(tree) from the bottom level, merges the most similar (or nearest) pair
of clusters, and stops when all the data points are merged into the root
cluster.
▪ Divisive (top-down): It starts with all data points in one cluster, the root,
Splits the root into a set of child clusters. Each child cluster is recursively
divided further, and stops when cluster with only a single point
▪ Pros:
• Dendrograms are great for visualization
• Provides hierarchical relations between clusters
• Shown to be able to capture concentric clusters
▪ Cons:
• Not easy to define levels for clusters.
• other clustering techniques outperform hierarchical clustering
o Probabilistic Clustering: Mixture of Gaussian Models.
❖ Hard clustering vs. Soft Clustering:

Hard Clustering Soft Clustering


Each data point can belong to multiple
Each data point is clustered or grouped to
clusters along with its probability score or
any one cluster.
likelihood.
Example: K-means Clustering Example: Fuzzy C-Means

❖ Problems with Euclidean distance at high dimensions Euclidean distance loses pretty
much all meaning.
❖ Binary attribute: an attribute that has two values or states but no ordering
relationships.
❖ We use a confusion matrix to introduce the distance functions / measures.
❖ Clustering Criteria:
o Similarity Function: use an appropriate distance function.
o Stopping Criteria:
▪ No (or minimum) re-assignments of data points to different clusters.
▪ No (or minimum) change of centroids.
▪ Minimum decrease in the sum of squared error.
o Cluster Quality
▪ Intra-cluster cohesion (compactness)
• measures how near the data points in a cluster are to the cluster
centroid.
• Sum of squared error (SSE) is a commonly used measure.
▪ Inter-cluster separation (isolation)
• different cluster centroids should be far away from one another.
❖ Normalization: technique to force the attributes to have a common value range
o Two main approaches to standardize interval scaled attributes, range and z-
score.
❖ Z-score: transforms the attribute values so that they have a mean of zero and a mean
absolute deviation of 1.
❖ Clustering evaluation measures are Entropy and Purity.
o Entropy: measures the uncertainty of a random variable, it characterizes the
impurity of an arbitrary collection of examples.
▪ The higher the entropy, the more the information content.
▪ If the entropy is 0, then the outcome is “certain”
▪ If the entropy is maximum, then any outcome is equally possible.
o Purity: measures the extent that a cluster contains only one class of data.
Lecture 6
❖ Feature Selection (FS): process of choosing an optimal subset of features according to
a certain criterion.
❖ The selection can be represented as a binary array
o value 1 if the feature is currently selected by the algorithm and
o value 0 if the feature does not occur.
o There should be a total of 2𝑀 subsets where 𝑀 is the number of features of a
data set.
❖ Importance of Feature Selection:
o Improve performance
o Visualize the data for model selection
o Reduce dimensionality and remove noise
❖ Feature Selection Search Directions:
o Sequential Forward Generation (SFG):
▪ Starts with an empty set of features
▪ Features are added into the set according to some criterion that
distinguish the best feature from the others
▪ The stopping criteria:
• A threshold for the number of relevant features 𝑚
• The generation of all possible subsets in brute force mode
o Sequential Backward Generation (SBG):
▪ Starts with a full set of features.
▪ iteratively, features are removed one at a time.
▪ The criterion must point out the worst or least important feature.
▪ The stopping Criteria: when the subset is only composed of a unique
feature, which is the most informative of the whole set.
o Bidirectional Generation (BG):
▪ Starts in both directions, performing SFG and SBG concurrently.
▪ The stopping criteria:
• One search finds the best subset comprised of m features before
it reaches the exact middle.
• Both searches reach the middle of the search space.
o Random Generation (RG):
▪ Starts the search in a random direction
▪ Does not follow a fixed way for subset generation.
▪ Unlike SFG or SBG, the size of the subset of features cannot be specified
▪ The choice of adding or removing a feature is a random decision
❖ Feature Selection Search Strategies:
o Exhaustive Search:
▪ Explore all possible subsets to find the optimal ones.
▪ Space complexity is 𝑂(2𝑀 )
o Heuristic Search:
▪ Employs heuristics to carry out the search.
▪ Prevents brute force search.
▪ Will surely find a non-optimal subset of features
▪ The number of subsets generated is 𝑂(𝑀).
o Nondeterministic Search (random search strategy):
▪ Complementary combination of Exhaustive and Heuristic Searches.
▪ Generate the best subsets constantly and keep improving the quality of
selected features as time goes by.
❖ Brute force algorithm: simple and straightforward approach to solve a problem by
trying every possible solution until finding the best one
o Advantage: easy to understand and implement.
o Disadvantage: very inefficient and time-consuming.
❖ Feature Selection Measures:
o Information Measures: measure the uncertainty of the receiver.
▪ Examples: Shannon’s Entropy, Information Gain.
o Distance Measures: Measures of separability, discrimination or divergence.
▪ The most typical is derived from distance between the class conditional
density functions.
o Dependence Measures: measures of association or correlation.
▪ Quantify how strongly two variables are correlated or present some
association with each other
o Consistency Measures: attempt to find a minimum number of features that
separate classes as the full set of features can
▪ They aim to achieve 𝑃(𝐶|𝐹𝑢𝑙𝑙𝑆𝑒𝑡) = 𝑃(𝐶|𝑆𝑢𝑏𝑆𝑒𝑡).
▪ Inconsistency: two examples with the same inputs (same feature values)
but with different output feature values (classes in classification).
o Accuracy Measures: relies on the classifier or learner.
❖ Filters: measuring uncertainty, distances, dependence or consistency. usually fast.
o Does not rely on a particular learning bias.
o Can handle larger sized data.
❖ Wrappers: can achieve the purpose of improving the learner’s predictive performance
o Allows a learning algorithm to fully exploit its bias, unlike filters.
❖ Embedded FS: like wrappers, but the features are selected during the learning process.
o Take advantage of the available data by not requiring splitting the training data
into a training and validation.
❖ Output of Feature Selection:
o Feature Ranking Techniques:
▪ The output is a ranked list of features which are ordered according to
evaluation measures.
▪ Returns the relevance of the features.
o Minimum Subset Techniques:
▪ The number of relevant features is a parameter that is often not known
by the practitioner.
▪ There must be a second category of techniques focused on obtaining
the minimum possible subset without ordering the features.
▪ Whatever is relevant within the subset, is otherwise irrelevant
❖ Feature Selection Evaluation:
o Inferability: improvement of the prediction of unseen examples with respect to
the direct usage of the raw training data
o Interpretability: Given the incomprehension of raw data by humans, DM is also
used for generating more understandable structure representation that can
explain the behavior of the data.
o Data Reduction: better and simpler to handle data with lower dimensions in
terms of efficiency and interpretability.
❖ Derived Feature Selection Evaluation Techniques:
o Accuracy
o Complexity
o Number of Features Selected
o Speed of the FS method
o Generality of the features selected
❖ Feature Selection Drawbacks:
o The resulting subsets are strongly dependent on the training set size.
o A backward removal strategy is very slow when working with large-scale data
sets.
o The outcome will still be left with a relatively large number of relevant features.
❖ Three major components to categorize Representative Methods combinations:
o Search Direction
o Search Strategy
o Evaluation Measure
Lecture 7
❖ Decision Tree: a graph that is used to represent choices and their results in the form
of a tree.
o The nodes in the graph represent an event or choice.
o The edges in the graph represent the decision rules or conditions.
o The tree is terminated by leaf nodes that represent the result of following a
combination of decisions.
o Mostly used in Machine Learning and Data Mining using Python.
o Built using recursive partitioning (divide-and-conquer):
▪ Uses the feature values to split the data into smaller subsets of similar
classes.
o Supervised learning algorithm.
o Can be used for solving regressions and classifications.
❖ Greedy Algorithm: always makes the choice that seems to be the best at the moment.
❖ Algorithms used in Decision Trees:
o ID3 (extension of D3): builds decision trees using a top-down greedy search
approach through the space of possible branches with no backtracking.
▪ It begins with the original set S as the root node.
▪ On each iteration of the algorithm, it iterates through the very unused
attribute of the set S and calculates Entropy(H) and Information gain
(IG) of this attribute.
▪ It then selects the attribute which has the smallest Entropy or Largest
Information gain.
▪ The set S is then split by the selected attribute to produce a subset of
the data.
▪ The algorithm continues to recur on each subset, considering only
attributes never selected before.
❖ The Decision Tree Algorithms strengths and weaknesses:

Strengths Weaknesses
More efficient than other complex models Easy to overfit or underfit the model
Can be used on data with relatively few Small changes in training data can result
training examples or a very large number. in large changes to decision logic.

Can handle numeric, nominal features, or Often biased towards splits on features
missing data. having a large number of levels.

❖ Geni Index: a method used to select from n attributes of the dataset which attribute
would be placed at the root or the terminal node.
o It measures how often a randomly chosen element would be incorrectly
identified.
o An attribute with lower Gini index should be preferred.
❖ Entropy Formula:

−(𝑝+ log 2 (𝑝+ )) − (𝑝− log 2 (𝑝− ))

o p+ is the probability of positive examples.


o p- is the probability of negative examples.
❖ Information Gain: measures how well a given attribute separates the training examples
according to their target classification.
o Used to select among the candidate attributes at each step while growing the
tree.
o Gain is measure of how much we can reduce uncertainty (Value lies between
0,1).
o Information Gain Formula:
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑝𝑎𝑟𝑒𝑛𝑡) − 𝑝+ ∗ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(+) − 𝑝− ∗ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(−)
❖ CART in Decision Trees:
o Classification Trees: used to separate the dataset into classes belonging to the
response variable.
o Regression Trees: needed when the response variable is numeric or continuous.
❖ Ways to remove overfitting in Decision Trees:
o Pruning Decision Trees:
▪ Remove the decision nodes starting from the leaf node such that the
overall accuracy is not disturbed.
▪ This is done by separating out the actual training set into two sets:
training data set, D and validation data set, V.
▪ Prepare the decision tree using the spearated training data set, D. Then
continue trimming the tree accordingly to optimize the accuracy of the
validation data set, V
o Random Forest:
▪ Has two main concepts:
• A random sampling of training data set when building trees.
• Random subsets of features considered when splitting nodes.
▪ A technique known as bagging is used to create an ensemble of trees
where multiple training sets are generated with replacement.
• In the bagging technique, a data set is divided into N samples
using randomized sampling. Then, using a single learning
algorithm a model is built on all samples.
Lecture 8
❖ Artificial Neural Network: a pool of simple processing units which communicate by
sending signals to each other over many weighted connections.
o Learns using trial-and-error:
▪ Trial: Processing an input to produce an output
• In terms of ANN: Compute the output function of a given input
▪ Evaluate: comparing the actual output with the expected output.
▪ Adjust: Adjust the weights
❖ Von Neumann vs. Biological Neural Systems:

Von Neumann Biological Neural Systems


Processor complex, high speed, few simple, low speed, many
separate from processor, integrated into processor,
Memory content addressable
non-content addressable
centralized, sequential distributed, parallel self-
Computing
stored programs learning fault tolerant

❖ ANN Components:
o Processing units (cells or neurons).
o A state of activation for every unit: which is equivalent to the output of the unit.
o Connections between the units: each connection is defined by a weight
o Propagation rule: determines the effective input of a unit from its external
inputs
o Activation function: which determines the new level of activation based on the
effective input and the current activation.
o External input for each unit.
o Learning Rule: a method for information gathering.
o Environment: providing input and error signals
❖ Layers in ANN:
o Input Layer
o Hidden Layer
o Output Layer

❖ Back Propagation: supervised learning that is used at each layer to minimize the error
between the layer’s response and the actual data.
o The error at each hidden layer is an average of the evaluated error.
o Hidden layer networks are trained this way
❖ ANN Design Issues:
o Transfer Functions: map inputs and the weights to produce output.
▪ Linear: The output is relative to the total weighted input.
▪ Threshold: The output is set at one of two values, depending on whether
the total weighted input is greater than or less than some threshold
value.
▪ Non-Linear: The output varies continuously but not linearly as the input
changes.
o Error Estimation:
▪ Root Mean Square Error (RMSE): measure the differences between
predicted values by a model or an estimator and the actual values.
Lecture 9
❖ Biometrics: the measurement and analysis of unique physical or behavioral
characteristics of individuals for the purpose of identification or authentication.
❖ Face recognition: a biometric technology that involves identifying or verifying
individuals by analyzing and comparing patterns in their facial features.
❖ Difference between Identification and Verification:
o Verification: The system compares the given individual with who they say they
are and gives yes or no decision.
o Identification: The system compares the given individual to all other individuals
in database and give a ranked list of matches.
❖ Face Recognition Workflow:

o Detection: involves locating facial regions using algorithms that can detect
facial landmarks.
o Feature Extraction: the system extracts distinctive features from each face.
o Matching: The extracted facial features are then compared against a database
of known faces.
o Recognition/Verification:
▪ In identification scenarios: the system attempts to match the detected
face against all faces in the database to find a potential match
▪ In verification scenarios: the system compares the detected face against
a single specific identity in the database
❖ Implementation of Face Recognition:
o Data Collection: Gather a dataset of facial images for both training and testing
purposes.
o Preprocessing: Clean and preprocess the facial images to improve the quality
and consistency of the data.
o Feature Extraction: extract relevant features from the preprocessed facial
images
▪ Examples: PCA, HOG, CNNs.
o Model Training: Train a machine learning or deep learning model using the
extracted features and corresponding labels
▪ Examples: SVM, KNN, and CNNs
o Validation and Testing: Evaluate the trained model on a separate validation
dataset to fine-tune hyperparameters and assess its performance.
▪ Examples: Accuracy, Precision, Recall, F1-Score.
o Deployment: involves integrating the model into an existing real-world system.
o Monitoring and Maintenance.
❖ Face Recognition Strengths and Weaknesses:
o Strengths:
▪ Convenient and User-friendly.
▪ High Accuracy.
▪ Supports a wide range of applications.
▪ Scalability: can be easily scaled to accommodate large databases.
o Weaknesses:
▪ Privacy Concerns: related to the collection and use of biometric data
▪ Bias and Accuracy Issues: may exhibit biases and inaccuracies such as
skin color, gender.
▪ Security Risks: vulnerable to security risks such as spoofing attacks.
▪ Legal and Ethical Concerns: including compliance with data protection
regulations.
▪ Reliance on Quality Data: depend heavily on the quality and diversity of
the training data.
❖ Face Recognition Applications:
o Access Control: Securing physical spaces or digital resources by granting access
based on facial recognition.
o Surveillance: Monitoring and identifying individuals in public spaces for security
or law enforcement purposes.
o Authentication: Verifying identities for user access to devices, applications, or
online services.
o Healthcare: Patient identification and monitoring in medical settings.

You might also like