Fundamentals of ML 1
Fundamentals of ML 1
• Taxonomy of ML
• Supervised learning • ML deployment approaches and toolkits
• Unsupervised learning • Common ML applications
• Reinforcement Learning • Sample ML deployment approaches
• ML development framework and toolkits
What is ML?
• ML is an iterative process driven by data. • Once set up and configured, ML system can
continue to “self-learning” without any
• ML has the ability learn from data and additional programming.
create representative models can be used
for prediction purposes.
Test
Analyse Specify Code Go live &
data problem system system
syste
assess
outcome domain
m
system
rule traditional
s software
• Historically, the seeds of ML were sown in the late 1950s, when an IBM Engineer – Albert
Samuel, developed computer systems for enhanced pattern recognition. He established basic
rules and concepts for machine learning which became the precursor to later Artificial
Intelligence systems.
• In 2006, Geoffrey Hinton published an academic paper captioned with the words “Deep
Learning”, which described an attempt at hand-writing recognition using an automated system
fashioned after the human cerebral cortex with its neurons and synapses. The automated
system was termed a deep neural network.
Taxonomy of Machine Learning
Taxonomy of Machine Learning
Machine
learning
techniques
Machine
learning
techniques
Supervised Unsupervise
learning d learning
• Supervised learning requires labelled • The error is the direct feedback to model in
training and test datasets that have known order to predict outcome/future in order to
independent X-variables (features) and decrease the error.
known output Y-variable values.
• ML algorithms then determine how the X
and Y variables are best related, thereby
creating a trained prediction model, which
can then be applied against real non- Targets (Teacher)
• Labelled data
labelled datasets to predict Y-variable values • Direct feedback
from the inputs. • Predict Error
outcome/future
Supervised Output
Input
Learning
Supervised Learning
• Typical applications:
• Regression – process of predicting a continuous, numerical value for input data sample e.g. house price,
temperature forecasting.
• Classification — process of assigning category to input data sample e.g. predicting whether a person is ill
or not, detecting fraudulent transactions, face classifier.
Supervised Learning
• ML Generalisations
• ML systems that have been trained on sample data are able to make viable predictions when out in the
field. This predictive behaviour is termed generalisation.
• There are two primary ways that generalisation may be carried out:
• Instance-based generalisations
• The ML system uses the training data to identify known instances and memorize their “patterns”. When the ML
model is then used out in the field with real data, it compares their patterns with what has been memorised, to
allow it to make predictions.
• Model-based generalisations
• The ML system builds a generalisable, extensible, model from the training data, and then uses this model to make
predictions when used in the field.
Unsupervised Learning
• Reinforcement learning involved an agent is observing • Basic reinforcement is modelled as Markov Decision
the state based on the information receiving from the Process (MDP)
environment, which is stored and then used for choosing
right action. • The goal of optimisation set in many ways depending on
reinforcement learning approach e.g. based on value
• Training of an agent a process of trial and error in various function, gradient policy, environment model.
situations and learn from receiving a reward that can be
either positive or negative. The reward is a feedback that • Optimisation algorithms include Q-Learning, Deep
can be used by an agent to update its parameters. Adversarial Networks etc.
• Decision process
• Reward system
• Learn series of actions
Reinforcement Learning
Teacher Student
Supervised
Human
Learning
Machine Unsupervised
Learning
Human Semi -
Supervised
Machine Learning
Human
Reinforcement
Machine Learning
Data Source
Data Sourcing
• ML data analysis uses algorithms to continuously improve itself over time, but quality data is necessary for
these models to operate efficiently.
• For ML models to understand how to perform various actions, training datasets must first be fed into the ML
algorithm, followed by validation datasets (or testing datasets) to ensure that the model is interpreting this
data accurately.
• Where possible data sources must be reliable and valid. The sampling strategy used to generate samples for
training models must be based on prior analyses by domain or subject experts. The sample size should also
follow usual rules to ensure representativeness and minimise sampling error. As a rough rule of thumb, the
sample size used for training an ML model should ideally be at least equal to the square root of the population
size.
• Where do data come from?
• Internal sources: already collected by or is part of the overall data collection of you organisation.
• For example: business-centric data that is available in the organisation data base to record day to day operations; scientific or
experimental data.
• Existing external sources: available in ready to read format from an outside source for free or for a fee.
• For example: public government databases, stock market data, Yelp reviews, [your favourite sport]-reference.
• External sources requiring collection efforts: available from external source but acquisition requires special processing.
• For example: data appearing only in print form, or data on websites.
Data Sourcing
Identify Data types and 1.Is it structured, semi- It is advisable to characterise the
properties structured or unstructured data using a well-defined
data? framework e.g. Data V-
2.What is the typical size of the characteristics framework.
datasets?
3.Is data/information needed in
real-time or near-real-time?
Data Confidentiality & Privacy
• There appears to be an inherent contradiction between the aims of ML and that of ethics and
data privacy. ML often seeks to uncover hidden data or information pertaining often to
clients, whilst data privacy seeks to protect client data from intrusive actions.
• Out of this apparent paradox has come the notion of privacy-preserving ML (PPML), which seeks
to achieve both seemingly contradictory objectives.
Source:
https://arxiv.org/ftp/arxiv/papers/1804/1804.11238.pdf
Data Confidentiality & Privacy
Additive Perturbation
Multiplicative Perturbation
Value Distortion Data Microaggregation
Data Anonymisation
Data Swapping
Other Randomisation Techniques
Data Perturbation
Sampling Method
Data Hiding Probability Distribution
Analytical Method
PPML
Data Perturbation
Association Rule Hiding
Data Blocking
Rule Hiding
• In real life scenarios, data that is obtained for use in developing and working with ML systems is often
imperfect. This may manifest as missing, duplicate or incorrectly entered data values.
• At the core of modern ML-based systems which depend on data to derive their predictive power.
Because of this, all ML projects are dependent on high data quality.
• There are numerous data quality issues that can threaten to derail ML systems the following issues need
to be considered and prevented before issues arise:
• Inaccurate, incomplete and improperly labelled data
• Having too much data
• Having too little data
• Biased data
• Unbalanced data
• Inconsistent data
• Data sparsity
• Data labeling issues
• For most ML applications therefore, the first steps after injecting data is to carry out data cleansing (also
termed data scrubbing or data cleaning).
Data Dimensionality Considerations
• Dimensionality in statistics refers to how many attributes a dataset has. In other words, it refers
to the number of features (X-input variables) that are deemed relevant to creating the ML model.
• For example, healthcare data is notorious for having vast amounts of variables (e.g. blood pressure,
weight, cholesterol level).
• In practice, this is difficult to do, in part because many variables are inter-related (like weight and blood
pressure).
• With high Dimensional data, the number of features can exceed the number of observations.
Moreover, working with too many dimensions increases the likelihood of overfitting the model,
which generally reduces the predictive performance with real/live data in the field.
• For example, variables are inter-related e.g. weight and blood pressure
Data Dimensionality Considerations
• For computational efficiency, we want to use only the data dimensions that are directly relevant
to the model to be built, to ensure that meaningful patterns are found.
• For this reason, for ML systems there may be the need for pre-processing that transforms the
data to one with a smaller number of dimensions (termed descending dimensions or dimensional
scaling).
• The transformation process typically combines the information contained in many dimensions, into a
smaller number of dimensions.
• This allows a minimal number of dimensions to be used for ML modelling such that the behaviour of the
transformed dataset is close enough to that derived from the original high-dimension dataset.
• This reduces ML computational overheads and makes it easier to both discern meaningful patterns and
visualize the data.
• Later on in the module we will discuss techniques for reducing or descending dimensions such as
the Principal Component Analysis (PCA) technique – this is also known as “General Factor
Analysis”.
ML Deployment
Approaches and
Toolkits
Common ML Applications
Source: Martinez-Plumed, F. et al. (2020) ‘CRISP-DM Twenty Years Later: From Data Mining Processes to Data
Science Trajectories’, IEEE Transactions on Knowledge and Data Engineering, 4347(c), pp. 1–1. doi:
10.1109/TKDE.2019.2962680
ML Deployment Approach
Source:
https://thinkinsights.net/digital/crisp-dm/
Another ML Deployment Approach
Source: Martinez-Plumed, F. et al. (2020) ‘CRISP-DM Twenty Years Later: From Data Mining
Processes to Data Science Trajectories’, IEEE Transactions on Knowledge and Data
Engineering, 4347(c), pp. 1–1. doi: 10.1109/TKDE.2019.2962680
Another ML Deployment Approach
• In this module, we will be using some essential software packages to develop ML systems
including:
• Loading data from a variety of sources
• Manipulating data, typically as frames
• Performing statistical analysis
• Visualization of data
• Implement ML model
• Model evaluation
ML System Development Toolkits
Anaconda creates a virtual Jupyter-Notebook is the Python is the NumPy is the true analytical
environment within an OS integrated development programming language workhorse of the Python language. It
to allow Python workflow environment (IDE) for Python framework, which is provides the user with
offered as part of the Anaconda multidimensional arrays, along with a
commonly used for ML, AI,
large set of functions to operate a
software. It provides an data science etc. multiplicity of mathematical
interactive Python command operations on these arrays. Arrays are
shell which is based on shell, web blocks of data that are arranged along
browser, and the application multiple dimensions, which implement
interface), with graphical mathematical vectors and matrices.
integration, customisable Characterized by optimal memory
commands, rich history (in the allocation, arrays are useful—not just
JSON format), and computational for storing data, but also for fast
parallelism for an enhanced matrix operations (vectorization),
which are indispensable when you
performance.
wish to solve adhoc data science
problems.
ML System Development Toolkits
Matplotlib is a library that contains all the building Seaborn is a high-level visualisation package
blocks that are required to create quality plots from based on matplotlib and integrated with pandas
arrays and to visualize them interactively. It offers a data structures (such as Series and DataFrames)
variety of graph types e.g. bar-charts, histograms, X-Y capable to produce informative and beautiful
line graphs, pie charts, scatter plots etc., which can be statistical visualizations.
generated and customized using a small set of
commands.
ML System Development Toolkits