0% found this document useful (0 votes)
18 views

Fundamentals of ML 1

Fundamentals of ML 1

Uploaded by

076bch026.priya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Fundamentals of ML 1

Fundamentals of ML 1

Uploaded by

076bch026.priya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

CP70066E Machine Learning

Machine Learning Fundamental

Professor Jonathan Loo


Chair in Computing and Engineering
School of Engineering and Computing
University of West London
Lesson Outline

• Machine learning concepts and principles • ML consideration and issues


• What is ML? • Data source
• Historical development of ML • Data confidentiality & privacy
• Key concepts underpinning ML • Data quality issues
• Data dimensionality considerations

• Taxonomy of ML
• Supervised learning • ML deployment approaches and toolkits
• Unsupervised learning • Common ML applications
• Reinforcement Learning • Sample ML deployment approaches
• ML development framework and toolkits
What is ML?

• Artificial Intelligence (AI) is the


software applications design which Deep
Learning
exhibit human-like behaviour, e.g.
speech, natural language processing,
reasoning or intuition.
Representation
Learning

• Machine Learning (ML) supports AI, Machine


specially in the discipline based on Learning
pattern recognition and learning
theory which explores algorithms that Artificial
can learn to make predictions. Intelligence
Introduction to
Machine Learning
What is ML?

• ML is an iterative process driven by data. • Once set up and configured, ML system can
continue to “self-learning” without any
• ML has the ability learn from data and additional programming.
create representative models can be used
for prediction purposes.

Test
Analyse Specify Code Go live &
data problem system system
syste
assess
outcome domain
m
system
rule traditional
s software

Encod Traditional programmed system


e
In traditional software, we discover
rules/patterns/model manually and encode them
with a programming language Training Test
data data
data
model
ML system goes
outcom machine learning Test
Analyse Specify Build ML live & continue
e syste
problem system system learning in the
m
domain field
Supervise Encod
e

A ML system discovers rules/patterns/model


automatically by learning from examples. After training ML System
it produces a model that “knows” these patterns, but
we still need to supervise it to make sure the model is
correct.
Historical development of ML

• Historically, the seeds of ML were sown in the late 1950s, when an IBM Engineer – Albert
Samuel, developed computer systems for enhanced pattern recognition. He established basic
rules and concepts for machine learning which became the precursor to later Artificial
Intelligence systems.

• In 2006, Geoffrey Hinton published an academic paper captioned with the words “Deep
Learning”, which described an attempt at hand-writing recognition using an automated system
fashioned after the human cerebral cortex with its neurons and synapses. The automated
system was termed a deep neural network.
Taxonomy of Machine Learning
Taxonomy of Machine Learning

Machine
learning
techniques

Supervised Unsupervise Reinforcement


learning d learning learning

Concerned with Concerned with


classified unclassified No data
(labelled) data (unlabelled)
data
Taxonomy of Machine Learning

Machine
learning
techniques

Supervised Unsupervise
learning d learning

Classificatio Regression Clustering


n

• Logistic regression • Linear regression • K-means clustering


• K-nearest neighbours • Multiple regression • Hierarchical
• Support Vector • Polynomial clustering
Machines regression • Density-based
• Decision tree clustering
• Ensemble method
(Random Forest)
Supervised Learning

• Supervised learning requires labelled • The error is the direct feedback to model in
training and test datasets that have known order to predict outcome/future in order to
independent X-variables (features) and decrease the error.
known output Y-variable values.
• ML algorithms then determine how the X
and Y variables are best related, thereby
creating a trained prediction model, which
can then be applied against real non- Targets (Teacher)
• Labelled data
labelled datasets to predict Y-variable values • Direct feedback
from the inputs. • Predict Error
outcome/future

Supervised Output
Input
Learning
Supervised Learning

• Typical applications:
• Regression – process of predicting a continuous, numerical value for input data sample e.g. house price,
temperature forecasting.
• Classification — process of assigning category to input data sample e.g. predicting whether a person is ill
or not, detecting fraudulent transactions, face classifier.
Supervised Learning

• ML Generalisations
• ML systems that have been trained on sample data are able to make viable predictions when out in the
field. This predictive behaviour is termed generalisation.

• There are two primary ways that generalisation may be carried out:
• Instance-based generalisations
• The ML system uses the training data to identify known instances and memorize their “patterns”. When the ML
model is then used out in the field with real data, it compares their patterns with what has been memorised, to
allow it to make predictions.
• Model-based generalisations
• The ML system builds a generalisable, extensible, model from the training data, and then uses this model to make
predictions when used in the field.
Unsupervised Learning

• Unsupervised learning system tries to identify


patterns pattern within non-labelled data that
allows a model to be constructed for predicting Y-
variable values given a known set of independent • No labels
• No feedback
X-variables (features). • Find hidden
structure in data
• In other words, unsupervised learning system
attempt to group similar data points within the
Unsupervised
dataset based on previously set criteria. Each Input Output
Learning
grouping points to a Y-value indication. The
grouping or clustering rules are used to create a
prediction model.
Unsupervised Learning

• Typical application areas


• Pattern recognition and clustering -
Process of dividing and grouping
similar data samples together e.g.
user base segmentation, denoising
• Density estimations – Data
dimension is the number of
features needed to describe data
sample
• Denoising, compression
(dimension reduction) - process of
compressing features into so-
called principal values which
conveys similar information
concisely
• Synthesis and sampling
Unsupervised Learning

In market analysis, the clustering rules can be used to segment


consumers based on a variety of demographic and lifestyle
factors. Once the ML models determines which segment to
place a consumer, it is then able to predict their buying
behaviour.

Unsupervised learning systems are useful when exploring datasets for


the purposes of characterizing them e.g. scrutinising datasets to detect
unknown anomalies or relationships.
Reinforcement Learning

• Reinforcement learning involved an agent is observing • Basic reinforcement is modelled as Markov Decision
the state based on the information receiving from the Process (MDP)
environment, which is stored and then used for choosing
right action. • The goal of optimisation set in many ways depending on
reinforcement learning approach e.g. based on value
• Training of an agent a process of trial and error in various function, gradient policy, environment model.
situations and learn from receiving a reward that can be
either positive or negative. The reward is a feedback that • Optimisation algorithms include Q-Learning, Deep
can be used by an agent to update its parameters. Adversarial Networks etc.

• Decision process
• Reward system
• Learn series of actions
Reinforcement Learning

AlphaGo Zero - How and Why it Works


http://tim.hibal.org/blog/alpha-zero-ho
w-and-why-it-works/comment-page-3/
Human and Machine Involvement in Machine Learning

Teacher Student

Supervised
Human
Learning

Machine Unsupervised
Learning

Human Semi -
Supervised
Machine Learning

Human
Reinforcement
Machine Learning
Data Source
Data Sourcing

• ML data analysis uses algorithms to continuously improve itself over time, but quality data is necessary for
these models to operate efficiently.
• For ML models to understand how to perform various actions, training datasets must first be fed into the ML
algorithm, followed by validation datasets (or testing datasets) to ensure that the model is interpreting this
data accurately.
• Where possible data sources must be reliable and valid. The sampling strategy used to generate samples for
training models must be based on prior analyses by domain or subject experts. The sample size should also
follow usual rules to ensure representativeness and minimise sampling error. As a rough rule of thumb, the
sample size used for training an ML model should ideally be at least equal to the square root of the population
size.
• Where do data come from?
• Internal sources: already collected by or is part of the overall data collection of you organisation.
• For example: business-centric data that is available in the organisation data base to record day to day operations; scientific or
experimental data.
• Existing external sources: available in ready to read format from an outside source for free or for a fee.
• For example: public government databases, stock market data, Yelp reviews, [your favourite sport]-reference.
• External sources requiring collection efforts: available from external source but acquisition requires special processing.
• For example: data appearing only in print form, or data on websites.
Data Sourcing

Plan Item Factors to be considered Additional/Relevant Issues

Identification of Data 1.Data sources It is often advisable to carry out a


Sources, Channels & 2.Target audiences Data-Stakeholder mapping
Sinks 3.Data transfer channels and exercise, to determine the data
intermediate conduits needs of key organisational
stakeholders, and their relative
importance to the company.

Identify Data types and 1.Is it structured, semi- It is advisable to characterise the
properties structured or unstructured data using a well-defined
data? framework e.g. Data V-
2.What is the typical size of the characteristics framework.
datasets?
3.Is data/information needed in
real-time or near-real-time?
Data Confidentiality & Privacy

• There appears to be an inherent contradiction between the aims of ML and that of ethics and
data privacy. ML often seeks to uncover hidden data or information pertaining often to
clients, whilst data privacy seeks to protect client data from intrusive actions.
• Out of this apparent paradox has come the notion of privacy-preserving ML (PPML), which seeks
to achieve both seemingly contradictory objectives.

Source:
https://arxiv.org/ftp/arxiv/papers/1804/1804.11238.pdf
Data Confidentiality & Privacy
Additive Perturbation
Multiplicative Perturbation
Value Distortion Data Microaggregation
Data Anonymisation
Data Swapping
Other Randomisation Techniques
Data Perturbation

Sampling Method
Data Hiding Probability Distribution

Analytical Method

Secure Multi-Party Computation (SMC)

PPML

Data Perturbation
Association Rule Hiding

Data Blocking
Rule Hiding

Classification Rule Hiding


Data Quality Issues

• In real life scenarios, data that is obtained for use in developing and working with ML systems is often
imperfect. This may manifest as missing, duplicate or incorrectly entered data values.
• At the core of modern ML-based systems which depend on data to derive their predictive power.
Because of this, all ML projects are dependent on high data quality.
• There are numerous data quality issues that can threaten to derail ML systems the following issues need
to be considered and prevented before issues arise:
• Inaccurate, incomplete and improperly labelled data
• Having too much data
• Having too little data
• Biased data
• Unbalanced data
• Inconsistent data
• Data sparsity
• Data labeling issues

• For most ML applications therefore, the first steps after injecting data is to carry out data cleansing (also
termed data scrubbing or data cleaning).
Data Dimensionality Considerations

• Dimensionality in statistics refers to how many attributes a dataset has. In other words, it refers
to the number of features (X-input variables) that are deemed relevant to creating the ML model.

• For example, healthcare data is notorious for having vast amounts of variables (e.g. blood pressure,
weight, cholesterol level).
• In practice, this is difficult to do, in part because many variables are inter-related (like weight and blood
pressure).

• With high Dimensional data, the number of features can exceed the number of observations.
Moreover, working with too many dimensions increases the likelihood of overfitting the model,
which generally reduces the predictive performance with real/live data in the field.
• For example, variables are inter-related e.g. weight and blood pressure
Data Dimensionality Considerations

• A known term the Curse of Dimensionality which is exemplified by the graph.

As the dimensionality increases, the ML


classifier’s performance increases until the
optimal number of features is reached.

Further increasing the dimensionality without


increasing the number of training samples
results in a decrease in classifier performance.
Data Dimensionality Considerations

• For computational efficiency, we want to use only the data dimensions that are directly relevant
to the model to be built, to ensure that meaningful patterns are found.
• For this reason, for ML systems there may be the need for pre-processing that transforms the
data to one with a smaller number of dimensions (termed descending dimensions or dimensional
scaling).
• The transformation process typically combines the information contained in many dimensions, into a
smaller number of dimensions.
• This allows a minimal number of dimensions to be used for ML modelling such that the behaviour of the
transformed dataset is close enough to that derived from the original high-dimension dataset.
• This reduces ML computational overheads and makes it easier to both discern meaningful patterns and
visualize the data.

• Later on in the module we will discuss techniques for reducing or descending dimensions such as
the Principal Component Analysis (PCA) technique – this is also known as “General Factor
Analysis”.
ML Deployment
Approaches and
Toolkits
Common ML Applications

• ML has bee applied for solving problems in areas such as:


• Financial analyses, consumer credit scoring and automated trading systems
• Energy and pollution modelling/forecasting
• Market segmentation and analysis
• Feature/facial recognition, image processing and computer vision
• Medical diagnoses, cancer detection, drug synthesis and DNA sequencing
• Self-driving vehicles
• Automated manufacturing systems
• Natural language processing
Standard ML Development Stages

Source: Martinez-Plumed, F. et al. (2020) ‘CRISP-DM Twenty Years Later: From Data Mining Processes to Data
Science Trajectories’, IEEE Transactions on Knowledge and Data Engineering, 4347(c), pp. 1–1. doi:
10.1109/TKDE.2019.2962680
ML Deployment Approach

• Cross-industry standard process for


data mining, known as CRISP-DM, is
an open standard process model that
describes common approaches used
by ML experts.
• CRISP-DM breaks the process of data
mining into six major phases.
• It is the most widely-used model for
deploying ML system.

Source:
https://thinkinsights.net/digital/crisp-dm/
Another ML Deployment Approach

Source: Martinez-Plumed, F. et al. (2020) ‘CRISP-DM Twenty Years Later: From Data Mining
Processes to Data Science Trajectories’, IEEE Transactions on Knowledge and Data
Engineering, 4347(c), pp. 1–1. doi: 10.1109/TKDE.2019.2962680
Another ML Deployment Approach

Source: Based on an extension to CRISP-DM proposed by Huber et al. (2019)


known as DMME – Data Mining Methodology for Engineering Applications.
ML System Development Toolkits

• In this module, we will be using some essential software packages to develop ML systems
including:
• Loading data from a variety of sources
• Manipulating data, typically as frames
• Performing statistical analysis
• Visualization of data
• Implement ML model
• Model evaluation
ML System Development Toolkits

Anaconda creates a virtual Jupyter-Notebook is the Python is the NumPy is the true analytical
environment within an OS integrated development programming language workhorse of the Python language. It
to allow Python workflow environment (IDE) for Python framework, which is provides the user with
offered as part of the Anaconda multidimensional arrays, along with a
commonly used for ML, AI,
large set of functions to operate a
software. It provides an data science etc. multiplicity of mathematical
interactive Python command operations on these arrays. Arrays are
shell which is based on shell, web blocks of data that are arranged along
browser, and the application multiple dimensions, which implement
interface), with graphical mathematical vectors and matrices.
integration, customisable Characterized by optimal memory
commands, rich history (in the allocation, arrays are useful—not just
JSON format), and computational for storing data, but also for fast
parallelism for an enhanced matrix operations (vectorization),
which are indispensable when you
performance.
wish to solve adhoc data science
problems.
ML System Development Toolkits

The pandas package deals with everything that NumPy


cannot do. Thanks to its specific data structures, namely
DataFrames and Series, pandas allows you to handle Scikit-learn is the core of data science operations
complex tables of data of different types (which is in Python. It offers all that you may need in terms
something that NumPy's arrays cannot do) and time of data pre-processing, supervised and
series. You will be able to easily and smoothly load data unsupervised learning, model selection,
from a variety of sources. You can then slice, dice, validation, and error metrics.
handle missing elements, add, rename, aggregate,
reshape, and finally visualize your data at will.

Matplotlib is a library that contains all the building Seaborn is a high-level visualisation package
blocks that are required to create quality plots from based on matplotlib and integrated with pandas
arrays and to visualize them interactively. It offers a data structures (such as Series and DataFrames)
variety of graph types e.g. bar-charts, histograms, X-Y capable to produce informative and beautiful
line graphs, pie charts, scatter plots etc., which can be statistical visualizations.
generated and customized using a small set of
commands.
ML System Development Toolkits

• A deep learning framework is an interface, library


or a tool which allows us to build deep learning
models more easily and quickly, without getting
into the details of underlying algorithms.
• They provide a clear and concise way for defining
models using a collection of pre-built and
optimised components.
• Below are some of the key features of a good deep
learning framework:
• Easy to understand and code
• Automatically compute gradients i.e.
backpropagation
• Optimized for performance such as fast GPU/CPU
implementation of matrix multiplication,
convolutions and backpropagation
• Process parallelism to reduce computations
• Good community supports and contributions e.g.
open source codes, models, pre-trained model etc.
End of Lesson

You might also like