Data Mining: M.P.Geetha, Department of CSE, Sri Ramakrishna Institute of Technology, Coimbatore
Data Mining: M.P.Geetha, Department of CSE, Sri Ramakrishna Institute of Technology, Coimbatore
M.P.GEETHA,
Department of CSE,
Sri Ramakrishna Institute of Technology,
Coimbatore
1
DATA MINING
• Other Applications
– Text mining (news group, email, documents) and Web mining
– Stream data mining
– DNA and bio-data analysis
Market Analysis and Management
• Where does the data come from?
– Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies
• Target marketing
– Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits,
etc.
– Determine customer purchasing patterns over time
• Cross-market analysis
– Associations/co-relations between product sales, & prediction based on such association
• Customer profiling
– What types of customers buy what products (clustering or classification)
• Sports
– IBM Advanced Scout analyzed NBA game statistics (shots
blocked, assists, and fouls) to gain competitive advantage for
New York Knicks and Miami Heat
• Astronomy
– JPL and the Palomar Observatory discovered 22 quasars with the
help of data mining
• Internet Web Surf-Aid
– IBM Surf-Aid applies data mining algorithms to Web access logs
for market-related pages to discover customer preference and
behavior pages, analyzing effectiveness of Web marketing,
improving Web site organization, etc.
Steps in Data Mining
Steps of a KDD Process
Steps of a KDD Process
• Learning the application domain
– relevant prior knowledge and goals of application
• Identifying a target data set: data selection
• Data processing
– Data cleaning (remove noise and inconsistent data)
– Data integration (multiple data sources maybe combined)
– Data selection (data relevant to the analysis task are retrieved from database)
– Data transformation (data transformed or consolidated into forms appropriate for mining)
(Done with data preprocessing)
– Data mining (an essential process where intelligent methods are applied to extract
data patterns)
– Pattern evaluation (indentify the truly interesting patterns)
– Knowledge presentation (mined knowledge is presented to the user with
visualization or representation techniques)
Data Exploration
Statistical Analysis, Querying and Reporting
Pattern evaluation
Data
Databases Warehouse
Data Mining and Databases
Data Mining: On What Kinds of Data?
• Relational database
• Data warehouse
• Transactional database
• Advanced database and information repository
– Object-relational database
– Spatial and temporal data
– Time-series data
– Stream data
– Multimedia database
– Heterogeneous and legacy database
– Text databases & WWW
Relational Databases
• DBMS – database management system, contains a collection of
interrelated databases
e.g. Faculty database, student database, publications database
• Each database contains a collection of tables and functions to manage and access
the data.
e.g. student_bio, student_graduation, student_parking
• Each table contains columns and rows, with columns as attributes of data and rows
as records.
• Tables can be used to represent the relationships between or among multiple tables.
Relational Databases (2) – AllElectronics store
Relational Databases (3)
• With a relational query language, e.g. SQL, we will be able to find answers
to questions such as:
– How many items were sold last year?
– Who has earned commissions higher than 10%?
– What is the total sales of last month for Dell laptops?
• When data mining is applied to relational databases, we can search for trends
or data patterns.
• Relational databases are one of the most commonly available and rich
information repositories, and thus are a major data form in our study.
Data Warehouses
• A repository of information collected from multiple sources, stored
under a unified schema, and that usually resides at a single site.
• Constructed via a process of data cleaning, data integration, data
transformation, data loading and periodic data refreshing.
Data Warehouses (2)
– Prediction
• Predict missing or unavailable numerical data values
Data Mining Functionalities
- What kinds of patterns can be mined?
Data Mining Functionalities
• Cluster Analysis
– Class label is unknown: group data to form new classes
– Clusters of objects are formed based on the principle of maximizing
intra-class similarity & minimizing interclass similarity
• E.g. Identify homogeneous subpopulations of customers. These clusters may
represent individual target groups for marketing.
Data Mining Functionalities
• Outlier Analysis
– Data that do no comply with the general behavior or model.
– Outliers are usually discarded as noise or exceptions.
– Useful for fraud detection.
• E.g. Detect purchases of extremely large amounts
• Evolution Analysis
– Describes and models regularities or trends for objects whose
behavior changes over time.
• E.g. Identify stock evolution regularities for overall stocks and for the stocks of
particular companies.
Are All of the Patterns Interesting?
• Data mining may generate thousands of patterns: Not all of them
are interesting
• A pattern is interesting if it is
– easily understood by humans
– valid on new or test data with some degree of certainty,
– potentially useful
– novel
– validates some hypothesis that a user seeks to confirm
• Subjective measures
– Reflect the needs and interests of a particular user.
• E.g. A marketing manager is only interested in characteristics of customers who shop
frequently.
Database
Statistics
Systems
Machine
Learning
Data Mining Visualization
Algorithm Other
Disciplines
Classification of data mining systems
• Database
– Relational, data warehouse, transactional, stream, object-oriented/relational, active, spatial,
time-series, text, multi-media, heterogeneous, legacy, WWW
• Knowledge
– Characterization, discrimination, association, classification, clustering, trend/deviation,
outlier analysis, etc.
– Multiple/integrated functions and mining at multiple levels
• Techniques utilized
– Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
• Applications adapted
– Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
Data Mining Task Primitives
Data Mining Task Primitives
(1) The set of task-relevant data – which portion of the database to be used
– Database or data warehouse name
– Discrimination
– Association
– Classification/prediction
– Clustering
– Outlier analysis
(5) Visualization methods – what form to display the result, e.g. rules,
• General functionality
– Descriptive data mining
– Predictive data mining
• Different views, different classifications
– Kinds of data to be mined
– Kinds of knowledge to be discovered
– Kinds of techniques utilized
– Kinds of applications adapted
Multi-Dimensional View of Data Mining
• Data to be mined
– Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW
• Knowledge to be mined
– Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
– Multiple/integrated functions and mining at multiple levels
• Techniques utilized
– Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, etc.
• Applications adapted
– Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, Web mining, etc.
Major Issues in Data Mining
Major Issues in Data Mining
• Mining methodology and User interaction
– Mining different kinds of knowledge
• DM should cover a wide spectrum of data analysis and knowledge discovery tasks
• Enable to use the database in different ways
• Require the development of numerous data mining techniques
– Interactive mining of knowledge at multiple levels of abstraction
• Difficult to know exactly what will be discovered
• Allow users to focus the search, refine data mining requests
– Incorporation of background knowledge
• Guide the discovery process
• Allow discovered patterns to be expressed in concise terms and different levels of
abstraction
– Data mining query languages and ad hoc data mining
• High-level query languages need to be developed
• Should be integrated with a DB/DW query language
Major Issues in Data Mining
– Presentation and visualization of results
• Knowledge should be easily understood and directly usable
• High level languages, visual representations or other expressive forms
• Require the DM system to adopt the above techniques
– Data Quality
• Data Cleaning
• Data Integration
• Data Reduction
54
Data Quality: Why Preprocess the Data?
– Data Quality
• Data Cleaning
• Data Integration
• Data Reduction
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
• Regression
– smooth by fitting the data into regression functions
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human (e.g.,
deal with possible outliers)
Data Cleaning as a Process
• Data discrepancy detection
– Use metadata (e.g., domain, range, dependency, distribution)
– Check field overloading
– Check uniqueness rule, consecutive rule and null rule
– Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code,
spell-check) to detect errors and make corrections
• Data auditing: by analyzing data to discover rules and relationship
to detect violators (e.g., correlation and clustering to find outliers)
• Data migration and integration
– Data migration tools: allow transformations to be specified
– ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
• Integration of the two processes
– Iterative and interactive (e.g., Potter’s Wheels)
Data Preprocessing
– Data Quality
• Data Cleaning
• Data Integration
• Data Reduction
(n 1) A B (n 1) A B
Scatter plots
showing the
similarity from
–1 to 1.
Correlation (viewed as linear relationship)
• Correlation measures the linear relationship
between objects
• To compute correlation, we standardize data
objects, A and B, and then take their dot product
a'k (ak mean( A)) / std ( A)
Correlation coefficient:
• Suppose two stocks A and B have the following values in one week: (2,
5), (3, 8), (5, 10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
– Data Quality
• Data Cleaning
• Data Integration
• Data Reduction
-1
+
-+
-+
2 2 0 2 3 5 4 4
-1
Original frequency distribution 0 -+
-
Why Wavelet Transform?
• Use hat-shape filters
– Emphasize region where points cluster
– Suppress weaker information in their boundaries
• Effective removal of outliers
– Insensitive to noise, insensitive to input order
• Multi-resolution
– Detect arbitrary shaped clusters at different scales
• Efficient
– Complexity O(N)
• Only applicable to low dimensional data
Principal Component Analysis (PCA)
• Find a projection that captures the largest amount of variation in data
• The original data are projected onto a much smaller space, resulting
in dimensionality reduction. We find the eigenvectors of the
covariance matrix, and these eigenvectors define the new space
x2
x1
Principal Component Analysis (Steps)
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
– Normalize input data: Each attribute falls within the same range
– Compute k orthonormal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal
component vectors
– The principal components are sorted in order of decreasing
“significance” or strength
– Since the components are sorted, the size of the data can be reduced
by eliminating the weak components, i.e., those with low variance
(i.e., using the strongest principal components, it is possible to
reconstruct a good approximation of the original data)
• Works for numeric data only
Attribute Subset Selection
• Another way to reduce dimensionality of data
• Redundant attributes
– Duplicate much or all of the information contained in
one or more other attributes
– E.g., purchase price of a product and the amount of
sales tax paid
• Irrelevant attributes
– Contain no information that is useful for the data
mining task at hand
– E.g., students' ID is often irrelevant to the task of
predicting students' GPA
Heuristic Search in Attribute Selection
• There are 2d possible attribute combinations of d attributes
• Typical heuristic attribute selection methods:
– Best single attribute under the attribute independence
assumption: choose by significance tests
– Best step-wise feature selection:
• The best single-attribute is picked first
• Then next best attribute condition to the first, ...
– Step-wise attribute elimination:
• Repeatedly eliminate the worst attribute
– Best combined attribute selection and elimination
– Optimal branch and bound:
• Use attribute elimination and backtracking
Attribute Creation (Feature Generation)
• Create new attributes (features) that can capture the
important information in a data set more effectively than
the original ones
• Three general methodologies
– Attribute extraction
• Domain-specific
– Mapping data to new space (see: data reduction)
• E.g., Fourier transformation, wavelet
transformation, manifold approaches
– Attribute construction
• Combining features Data discretization
Data Reduction 2: Numerosity Reduction
• Reduce data volume by choosing alternative, smaller
forms of data representation
• Parametric methods (e.g., regression)
– Assume the data fits some model, estimate model
parameters, store only the parameters, and discard
the data (except possible outliers)
– Ex.: Log-linear models—obtain value at a point in m-
D space as the product on appropriate marginal
subspaces
• Non-parametric methods
– Do not assume models
– Major families: histograms, clustering, sampling, …
Parametric Data Reduction: Regression
and Log-Linear Models
• Linear regression
– Data modeled to fit a straight line
– Often uses the least-square method to fit the line
• Multiple regression
– Allows a response variable Y to be modeled as a
linear function of multidimensional feature vector
• Log-linear model
– Approximates discrete multidimensional probability
distributions
y
Regression Analysis
Y1
W O R
SRS le random
i m p h ou t
( s e wi t
l
samp ment)
pl a c e
re
SRSW
R
Raw Data
Sampling: Cluster or Stratified Sampling
ss y
lo
Original Data
Approximated
Data Preprocessing
– Data Quality
• Data Cleaning
• Data Integration
• Data Reduction
73,600 54,000
1.225
16,000
– Ex. Let μ = 54,000, σ = 16,000. Then
• Normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
Discretization
• Three types of attributes
– Nominal—values from an unordered set, e.g., color, profession
– Ordinal—values from an ordered set, e.g., military or academic rank
– Numeric—real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
– Interval labels can then be used to replace actual data values
– Reduce data size by discretization
– Supervised vs. unsupervised
– Split (top-down) vs. merge (bottom-up)
– Discretization can be performed recursively on an attribute
– Prepare for further analysis, e.g., classification
Data Discretization Methods
• Typical methods: All the methods can be applied recursively
– Binning
• Top-down split, unsupervised
– Histogram analysis
• Top-down split, unsupervised
– Clustering analysis (unsupervised, top-down split or
bottom-up merge)
– Decision-tree analysis (supervised, top-down split)
– Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)
105
Simple Discretization: Binning