Pattern Summary Final
Pattern Summary Final
Lecture 1
❖ Supervised Learning has training set, features (predictors, or input), and outcome
(response or output).
❖ Unsupervised Learning observes only the features and has no outcome. We need to
cluster data or organize it.
❖ A pattern is a set of objects, processes or events which consist of both deterministic
and stochastic components.
❖ Machine learning is a method of teaching computers to learn from data.
o Pattern recognition and machine learning fields can be used to create systems
that can automatically detect and respond to patterns in data.
❖ Machine Learning vs Pattern Recognition:
❖ Detection vs Description:
Detection Description
something happened what has happened?
Examples: Heard noise, Saw something Examples: Gun shot, talking, laughing,
interesting, non-flat signals crying, etc
❖ Features: The intrinsic traits or characteristics that tell one pattern (object) apart from
another
o Allows Focus on relevant, distinguishing parts of a pattern and Data reduction
and abstraction.
❖ Importance of Features:
o Cannot be overstated.
o We usually don’t know which to select, what they represent, and how to tune
them.
o Classification and regression schemes are mostly trying to make the best of
whatever features are available.
o One feature is usually not descriptive.
▪Lack of features may be due to Relevance, Missing values, Dimensionality,
Time, and space varying characteristics.
❖ We can decide if a feature is effective through a training phase.
❖ Feature space: D dimensional (D the number of features) populated with features from
training samples.
❖ Decision boundary methods
o Learn the separation in the feature space.
o Examples: Cluster Centers, Decision Surfaces
❖ Parametric methods:
o Based on class sample exhibiting a certain parametric distribution.
o Learn the parameters through training.
o Example: Gaussian.
❖ Density methods:
o Does not enforce a parametric form.
o Learn the density function directly.
❖ A deterministic model is a model that assumes that the outcome of a system or process
is fully determined by its initial conditions and parameters
o does not involve any randomness or uncertainty.
o always produces the same result for the same input.
o can be useful when the system or process is well-understood, predictable, and
stable, and when the accuracy and precision of the model are important.
o Example: Crystal Structure.
❖ A stochastic model is a model that incorporates some elements of randomness or
uncertainty into the system or process.
o does not assume that the outcome of a system or process is fully determined
by its initial conditions and parameters.
o It can vary according to some probability distribution or function.
o can be useful when the system or process is complex, dynamic, and
unpredictable, and when the variability and distribution of the model are
important.
o Example: White Noise.
❖ Statistical Tests:
o t-test: Tests for the difference between the means of two independent groups.
o ANOVA: Tests for the difference between the means of three or more groups.
o F-test: Compares the variances of two groups.
o Chi-square test: Tests for relationships between categorical variables.
o Correlation analysis: Measures the strength and direction of the linear
relationship between two continuous variables.
❖ Machine Learning Models:
o Linear regression: Predicts a continuous outcome based on a linear relationship
with one or more independent variables.
o Logistic regression: Predicts a binary outcome based on a set of independent
variables.
o Naive Bayes: Classifies data points based on Bayes’ theorem and assuming
independence between features.
o Hidden Markov Models: Models sequential data with hidden states and
observable outputs.
❖ Density-based clustering algorithms: can deal with non-hyper spherical clusters and are
robust to outliers.
❖ Traditional vs Modern Pattern Recognition:
Traditional Modern
Hand-crafted features Automatically learned features
Syntactic Semantic
Feature detection and description are Feature detection and description are not
separate tasks from classifier design jointly optimized with classifiers
❖ Error rate refers to a measure of the degree of prediction error of a model made with
respect to the true model
❖ Two routes of Bayes Rule:
o Forward (synthesis) route: From class to sample in a class
o Backward (analysis) route: From sample to class ID (always harder).
❖ In bayes rule, we turn a backward (analysis) problem into several forward (synthesis)
problems, also known as analysis-by-synthesis.
❖ Types of errors:
o True positive the model correctly predicts the positive class.
o True negative the model correctly predicts the negative class.
o False positive the model incorrectly predicts the positive class.
o False negative the model incorrectly predicts the negative class.
❖ Various ways to measure error rate:
o Training & Testing error (under your control)
o Empirical error (generalization Error)
❖ Precision vs Recall: one goes up and the other HAS to go down.
❖ Mean vs Median:
Mean Median
- Traditional measure of center - A resistant measure of the data’s center
- Sum the values and divide by the number - At least half of the ordered values are
of values less than or equal to the median value
- At least half of the ordered values are
- Computation of mean is easier
greater than or equal to the median
- prone to noise value
- If n is odd, the median is the middle-
ordered value
- If n is even, the median is the average of
the two middle ordered values
❖ The mean and median of data from a symmetric distribution should be close together.
❖ Spread (Variability): exists when some values are different from (above or below) the
mean.
❖ Quartiles: Three numbers which divide the ordered data into four equal sized groups.
❖ Variance vs Covariance:
Variance Covariance
the average squared deviation from Measure of how much each of the dimensions
the mean of a set of data. vary from the mean with respect to each other
Measure of the deviation from the measured between two dimensions
mean for points in one dimension
used to find the standard deviation sees if there is a relation between two dimensions
Qualitative Quantitative
Categorical Numerical
Humans can analyze qualitative data to machine learning models can only deal
make a decision with quantitative data
Examples: Examples:
- Gender - Age
- Religion - Height
- Marital status - Weight
- Qualifications - Income
❖ The linear model is one of the simplest models in machine learning. It assumes that the
data is linearly separable and tries to learn the weight of each feature.
o We can view linear classification models in terms of dimensionality reduction.
❖ Intuitively, good features are those with large separation of means relative to
variances.
❖ Fisher’s Linear Discriminant:
o Selects a projection that maximizes the class separation. To do that, it
maximizes the ratio between the between-class variance to the within-class
variance.
o to project the data to a smaller dimension and to avoid class overlapping, it
maintains two properties:
▪ A large variance among the dataset classes, so that the projected class
averages should be as far apart as possible.
▪ A small variance within each of the dataset classes, so that a small within-
class variance has the effect of keeping the projected data points closer
to one another.
▪ To find the projection within the properties, it learns a weight vector
▪ Can be used as a supervised learning classifier.
Lecture 3
❖ Clustering: One way to summarize a complex real-valued data point with a single
categorical variable.
❖ Principal Component Analysis (PCA): An exploratory technique used to reduce the
dimensionality of the data set to 2D or 3D, can be used to:
o Reduce the number of dimensions in data.
o Find patterns in high-dimensional data.
o Visualize data of high dimensionality.
o Examples:
▪ Face recognition
▪ Image compression
▪ Gene expression analysis
❖ PCA steps to reduce dimensionality to 𝑟-dim:
o Compute Mean Vector 𝜇 and covariance matrix ∑ of original points.
o Compute eigenvectors and eigenvalues of ∑.
o Select top 𝑟 eigenvectors.
o Project points into subspace spanned by them: 𝑦 = 𝐴(𝑥 − 𝜇) where 𝑦 is the new
point, 𝑥 is the old one, and 𝐴 are the eigenvectors.
❖ Eigenvectors (𝜆) vs Eigenvalues (𝑥):
Eigenvectors Eigenvalues
those vectors that are only stretched, the factor by which an eigenvector is
with no rotation or shear stretched or squished
does not change direction in a The value zero can be eigenvalue
transformation
Called characteristic vector Called characteristic value
The zero vector cannot be an eigenvector. • 1 means no change,
• 2 means doubling in length,
• −1 means pointing backwards
Formula: det(𝐴 − 𝜆𝐼) = 0 Formula: 𝐴𝑥 = 𝜆𝑥
* 𝐴 is a square matrix and 𝐼 is the identity matrix.
MLE Bayesian
Faster (differentiation) Slow (integration)
Single model Multiple weighted
Known model Unknown model fine
Less information More information (nonuniform prior)
❖ Bayesian classifier and MAP will in general give different results when used to classify
new samples.
❖ Bayesian classifier is optimal, but can be very expensive, especially when many
hypotheses are kept and evaluated.
❖ Gibbs: randomly pick one hypothesis according to the current posterior.
Lecture 5
❖ Supervised Learning:
o Discover patterns in the data with known target (class) or label.
o These patterns are then utilized to predict the values of the target attribute in
future data instances.
❖ Unsupervised Learning:
o The data have no target attribute.
❖ Clustering: Task of grouping a set of data points such that data points in the same
group are more similar to each other, each group is known as a cluster.
o A cluster is represented by a single point, known as centroid.
▪ Centroid is computed as the means of all data points in a cluster.
▪ Cluster boundary is decided by the farthest data point in the cluster.
o The goals of clustering:
▪ Group data that are close (or similar) to each other.
▪ Identify such groupings (or clusters) in an unsupervised manner.
❖ Clustering Types:
o Exclusive Clustering: K-Means.
▪ Basic Idea: randomly initialize the k cluster centers and determine points
in each cluster by the closest one to the point.
▪ Properties: always coverage to some solution and can be “local
minimum”
▪ Cons: sensitive to initial centers and outliers and assumes that means
can be computed.
o Overlapping Clustering: Fuzzy C-Means.
▪ Each data point is separated into different clusters and then assigned a
probability score for being in that cluster.
▪ Pros:
• Allows a data point to be in multiple clusters.
• gives better results for overlapped data sets compared to k-
means clustering.
▪ Cons:
• Need to define the number of clusters.
• Sensitive to initial assignment of centroids. (not deterministic)
o Hierarchal Clustering: Agglomerative Clustering, Divisive Clustering.
▪ Produces a nested sequence of clusters, a tree, also called dendrogram.
▪ Agglomerative (bottom-up) “more popular”: builds the dendrogram
(tree) from the bottom level, merges the most similar (or nearest) pair
of clusters, and stops when all the data points are merged into the root
cluster.
▪ Divisive (top-down): It starts with all data points in one cluster, the root,
Splits the root into a set of child clusters. Each child cluster is recursively
divided further, and stops when cluster with only a single point
▪ Pros:
• Dendrograms are great for visualization
• Provides hierarchical relations between clusters
• Shown to be able to capture concentric clusters
▪ Cons:
• Not easy to define levels for clusters.
• other clustering techniques outperform hierarchical clustering
o Probabilistic Clustering: Mixture of Gaussian Models.
❖ Hard clustering vs. Soft Clustering:
❖ Problems with Euclidean distance at high dimensions Euclidean distance loses pretty
much all meaning.
❖ Binary attribute: an attribute that has two values or states but no ordering
relationships.
❖ We use a confusion matrix to introduce the distance functions / measures.
❖ Clustering Criteria:
o Similarity Function: use an appropriate distance function.
o Stopping Criteria:
▪ No (or minimum) re-assignments of data points to different clusters.
▪ No (or minimum) change of centroids.
▪ Minimum decrease in the sum of squared error.
o Cluster Quality
▪ Intra-cluster cohesion (compactness)
• measures how near the data points in a cluster are to the cluster
centroid.
• Sum of squared error (SSE) is a commonly used measure.
▪ Inter-cluster separation (isolation)
• different cluster centroids should be far away from one another.
❖ Normalization: technique to force the attributes to have a common value range
o Two main approaches to standardize interval scaled attributes, range and z-
score.
❖ Z-score: transforms the attribute values so that they have a mean of zero and a mean
absolute deviation of 1.
❖ Clustering evaluation measures are Entropy and Purity.
o Entropy: measures the uncertainty of a random variable, it characterizes the
impurity of an arbitrary collection of examples.
▪ The higher the entropy, the more the information content.
▪ If the entropy is 0, then the outcome is “certain”
▪ If the entropy is maximum, then any outcome is equally possible.
o Purity: measures the extent that a cluster contains only one class of data.
Lecture 6
❖ Feature Selection (FS): process of choosing an optimal subset of features according to
a certain criterion.
❖ The selection can be represented as a binary array
o value 1 if the feature is currently selected by the algorithm and
o value 0 if the feature does not occur.
o There should be a total of 2𝑀 subsets where 𝑀 is the number of features of a
data set.
❖ Importance of Feature Selection:
o Improve performance
o Visualize the data for model selection
o Reduce dimensionality and remove noise
❖ Feature Selection Search Directions:
o Sequential Forward Generation (SFG):
▪ Starts with an empty set of features
▪ Features are added into the set according to some criterion that
distinguish the best feature from the others
▪ The stopping criteria:
• A threshold for the number of relevant features 𝑚
• The generation of all possible subsets in brute force mode
o Sequential Backward Generation (SBG):
▪ Starts with a full set of features.
▪ iteratively, features are removed one at a time.
▪ The criterion must point out the worst or least important feature.
▪ The stopping Criteria: when the subset is only composed of a unique
feature, which is the most informative of the whole set.
o Bidirectional Generation (BG):
▪ Starts in both directions, performing SFG and SBG concurrently.
▪ The stopping criteria:
• One search finds the best subset comprised of m features before
it reaches the exact middle.
• Both searches reach the middle of the search space.
o Random Generation (RG):
▪ Starts the search in a random direction
▪ Does not follow a fixed way for subset generation.
▪ Unlike SFG or SBG, the size of the subset of features cannot be specified
▪ The choice of adding or removing a feature is a random decision
❖ Feature Selection Search Strategies:
o Exhaustive Search:
▪ Explore all possible subsets to find the optimal ones.
▪ Space complexity is 𝑂(2𝑀 )
o Heuristic Search:
▪ Employs heuristics to carry out the search.
▪ Prevents brute force search.
▪ Will surely find a non-optimal subset of features
▪ The number of subsets generated is 𝑂(𝑀).
o Nondeterministic Search (random search strategy):
▪ Complementary combination of Exhaustive and Heuristic Searches.
▪ Generate the best subsets constantly and keep improving the quality of
selected features as time goes by.
❖ Brute force algorithm: simple and straightforward approach to solve a problem by
trying every possible solution until finding the best one
o Advantage: easy to understand and implement.
o Disadvantage: very inefficient and time-consuming.
❖ Feature Selection Measures:
o Information Measures: measure the uncertainty of the receiver.
▪ Examples: Shannon’s Entropy, Information Gain.
o Distance Measures: Measures of separability, discrimination or divergence.
▪ The most typical is derived from distance between the class conditional
density functions.
o Dependence Measures: measures of association or correlation.
▪ Quantify how strongly two variables are correlated or present some
association with each other
o Consistency Measures: attempt to find a minimum number of features that
separate classes as the full set of features can
▪ They aim to achieve 𝑃(𝐶|𝐹𝑢𝑙𝑙𝑆𝑒𝑡) = 𝑃(𝐶|𝑆𝑢𝑏𝑆𝑒𝑡).
▪ Inconsistency: two examples with the same inputs (same feature values)
but with different output feature values (classes in classification).
o Accuracy Measures: relies on the classifier or learner.
❖ Filters: measuring uncertainty, distances, dependence or consistency. usually fast.
o Does not rely on a particular learning bias.
o Can handle larger sized data.
❖ Wrappers: can achieve the purpose of improving the learner’s predictive performance
o Allows a learning algorithm to fully exploit its bias, unlike filters.
❖ Embedded FS: like wrappers, but the features are selected during the learning process.
o Take advantage of the available data by not requiring splitting the training data
into a training and validation.
❖ Output of Feature Selection:
o Feature Ranking Techniques:
▪ The output is a ranked list of features which are ordered according to
evaluation measures.
▪ Returns the relevance of the features.
o Minimum Subset Techniques:
▪ The number of relevant features is a parameter that is often not known
by the practitioner.
▪ There must be a second category of techniques focused on obtaining
the minimum possible subset without ordering the features.
▪ Whatever is relevant within the subset, is otherwise irrelevant
❖ Feature Selection Evaluation:
o Inferability: improvement of the prediction of unseen examples with respect to
the direct usage of the raw training data
o Interpretability: Given the incomprehension of raw data by humans, DM is also
used for generating more understandable structure representation that can
explain the behavior of the data.
o Data Reduction: better and simpler to handle data with lower dimensions in
terms of efficiency and interpretability.
❖ Derived Feature Selection Evaluation Techniques:
o Accuracy
o Complexity
o Number of Features Selected
o Speed of the FS method
o Generality of the features selected
❖ Feature Selection Drawbacks:
o The resulting subsets are strongly dependent on the training set size.
o A backward removal strategy is very slow when working with large-scale data
sets.
o The outcome will still be left with a relatively large number of relevant features.
❖ Three major components to categorize Representative Methods combinations:
o Search Direction
o Search Strategy
o Evaluation Measure
Lecture 7
❖ Decision Tree: a graph that is used to represent choices and their results in the form
of a tree.
o The nodes in the graph represent an event or choice.
o The edges in the graph represent the decision rules or conditions.
o The tree is terminated by leaf nodes that represent the result of following a
combination of decisions.
o Mostly used in Machine Learning and Data Mining using Python.
o Built using recursive partitioning (divide-and-conquer):
▪ Uses the feature values to split the data into smaller subsets of similar
classes.
o Supervised learning algorithm.
o Can be used for solving regressions and classifications.
❖ Greedy Algorithm: always makes the choice that seems to be the best at the moment.
❖ Algorithms used in Decision Trees:
o ID3 (extension of D3): builds decision trees using a top-down greedy search
approach through the space of possible branches with no backtracking.
▪ It begins with the original set S as the root node.
▪ On each iteration of the algorithm, it iterates through the very unused
attribute of the set S and calculates Entropy(H) and Information gain
(IG) of this attribute.
▪ It then selects the attribute which has the smallest Entropy or Largest
Information gain.
▪ The set S is then split by the selected attribute to produce a subset of
the data.
▪ The algorithm continues to recur on each subset, considering only
attributes never selected before.
❖ The Decision Tree Algorithms strengths and weaknesses:
Strengths Weaknesses
More efficient than other complex models Easy to overfit or underfit the model
Can be used on data with relatively few Small changes in training data can result
training examples or a very large number. in large changes to decision logic.
Can handle numeric, nominal features, or Often biased towards splits on features
missing data. having a large number of levels.
❖ Geni Index: a method used to select from n attributes of the dataset which attribute
would be placed at the root or the terminal node.
o It measures how often a randomly chosen element would be incorrectly
identified.
o An attribute with lower Gini index should be preferred.
❖ Entropy Formula:
❖ ANN Components:
o Processing units (cells or neurons).
o A state of activation for every unit: which is equivalent to the output of the unit.
o Connections between the units: each connection is defined by a weight
o Propagation rule: determines the effective input of a unit from its external
inputs
o Activation function: which determines the new level of activation based on the
effective input and the current activation.
o External input for each unit.
o Learning Rule: a method for information gathering.
o Environment: providing input and error signals
❖ Layers in ANN:
o Input Layer
o Hidden Layer
o Output Layer
❖ Back Propagation: supervised learning that is used at each layer to minimize the error
between the layer’s response and the actual data.
o The error at each hidden layer is an average of the evaluated error.
o Hidden layer networks are trained this way
❖ ANN Design Issues:
o Transfer Functions: map inputs and the weights to produce output.
▪ Linear: The output is relative to the total weighted input.
▪ Threshold: The output is set at one of two values, depending on whether
the total weighted input is greater than or less than some threshold
value.
▪ Non-Linear: The output varies continuously but not linearly as the input
changes.
o Error Estimation:
▪ Root Mean Square Error (RMSE): measure the differences between
predicted values by a model or an estimator and the actual values.
Lecture 9
❖ Biometrics: the measurement and analysis of unique physical or behavioral
characteristics of individuals for the purpose of identification or authentication.
❖ Face recognition: a biometric technology that involves identifying or verifying
individuals by analyzing and comparing patterns in their facial features.
❖ Difference between Identification and Verification:
o Verification: The system compares the given individual with who they say they
are and gives yes or no decision.
o Identification: The system compares the given individual to all other individuals
in database and give a ranked list of matches.
❖ Face Recognition Workflow:
o Detection: involves locating facial regions using algorithms that can detect
facial landmarks.
o Feature Extraction: the system extracts distinctive features from each face.
o Matching: The extracted facial features are then compared against a database
of known faces.
o Recognition/Verification:
▪ In identification scenarios: the system attempts to match the detected
face against all faces in the database to find a potential match
▪ In verification scenarios: the system compares the detected face against
a single specific identity in the database
❖ Implementation of Face Recognition:
o Data Collection: Gather a dataset of facial images for both training and testing
purposes.
o Preprocessing: Clean and preprocess the facial images to improve the quality
and consistency of the data.
o Feature Extraction: extract relevant features from the preprocessed facial
images
▪ Examples: PCA, HOG, CNNs.
o Model Training: Train a machine learning or deep learning model using the
extracted features and corresponding labels
▪ Examples: SVM, KNN, and CNNs
o Validation and Testing: Evaluate the trained model on a separate validation
dataset to fine-tune hyperparameters and assess its performance.
▪ Examples: Accuracy, Precision, Recall, F1-Score.
o Deployment: involves integrating the model into an existing real-world system.
o Monitoring and Maintenance.
❖ Face Recognition Strengths and Weaknesses:
o Strengths:
▪ Convenient and User-friendly.
▪ High Accuracy.
▪ Supports a wide range of applications.
▪ Scalability: can be easily scaled to accommodate large databases.
o Weaknesses:
▪ Privacy Concerns: related to the collection and use of biometric data
▪ Bias and Accuracy Issues: may exhibit biases and inaccuracies such as
skin color, gender.
▪ Security Risks: vulnerable to security risks such as spoofing attacks.
▪ Legal and Ethical Concerns: including compliance with data protection
regulations.
▪ Reliance on Quality Data: depend heavily on the quality and diversity of
the training data.
❖ Face Recognition Applications:
o Access Control: Securing physical spaces or digital resources by granting access
based on facial recognition.
o Surveillance: Monitoring and identifying individuals in public spaces for security
or law enforcement purposes.
o Authentication: Verifying identities for user access to devices, applications, or
online services.
o Healthcare: Patient identification and monitoring in medical settings.