Unit - I
Unit - I
Data mining is one of the most useful techniques that help entrepreneurs, researchers,
and individuals to extract valuable information from huge sets of data.
Data mining is also called Knowledge Discovery in Database (KDD).
The knowledge discovery process includes Data cleaning, Data integration, Data
selection, Data transformation, Data mining, Pattern evaluation, and Knowledge
presentation.
WHAT IS DATA MINING?
The process of extracting information to identify patterns, trends, and useful data that
would allow the business to take the data-driven decision from huge sets of data is
called Data Mining.
Data mining is the act of automatically searching for large stores of information to find
trends and patterns that go beyond simple analysis procedures.
Data mining utilizes complex mathematical algorithms for data segments and evaluates
the probability of future events.
Data Mining is the mining, or discovery, of new information in terms of patterns or
rules from vast amounts of data.
To be useful, data mining must be carried out efficiently on large files and databases.
Data Mining is a process used by
organizations to extract specific data
from huge databases to solve business
problems.
It primarily turns raw data into useful
information.
DATA MINING IN BUSINESS INTELLIGENCE
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and
Reporting
Data Preprocessing/Integration, Data Warehouses
DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
6
Business intelligence (BI) can be described as
"a set of techniques and tools for the acquisition and transformation
of raw data into meaningful and useful information for business analysis
purposes“
GOALS OF DATA MINING
Prediction: Determine how certain attributes will behave in the future. For example,
how much sales volume a store will generate in a given period.
Identification: Identify patterns in data. For example, newly wed couples tend to
spend more money buying furniture.
Classification: Partition data into classes. For example, customers can be classified
into different categories with different behavior in shopping.
Optimization: Optimize the use of limited resources such as time, space, money or
materials. For example, how to best use advertising to maximize profits (sales).
TYPES OF KNOWLEDGE DISCOVERED DURING DATA MINING
Association rules: For example, when a male shopper buys a new car, he is likely to
buy a car CD.
Classification hierarchies: For example, mutual funds may be classified into three
categories: growth, income and stable.
Sequence patterns: Sequence patterns are temporal associations. For example, if
mortgage interest rate drops, within six months period the sales of houses will
increase by certain percentage.
Patterns within time series: such as stock price data behavior in time.
Detection of Similarity, or segmentation: For example, health data may indicate
similarity among subgroups of people.
STEPS IN DATA MINING PROCESS
Data comes from a variety of sources is integrated into a single data store
called target data
Data then is pre-processed and transformed into the standard format.
The data mining algorithms process the data to the output in the form of
patterns or rules.
Then those patterns and rules are interpreted to new or useful knowledge
or information.
1. Data cleaning
• to remove noise and inconsistent data
Stages to Data Mining
2. Data integration
• where multiple data sources may be combined
3. Data selection
• where data relevant to the analysis task are retrieved from
the database
4. Data transformation
• where data are transformed or consolidated into forms
appropriate for mining by performing summary or
aggregation operations, for instance
5. Data mining
• an essential process where intelligent methods are applied
in order to extract data patterns
6. Pattern evaluation
• to identify the truly interesting patterns representing
knowledge based on some interestingness measures
7. Knowledge presentation
• where visualization and knowledge representation
techniques are used to present the mined knowledge to
DATA MINING ARCHITECTURE
Knowledge base:
This is the domain knowledge that is used to guide the search or evaluate the interestingness of
resulting patterns.
Such knowledge can include concept hierarchies, used to organize attributes or attribute values into
different levels of abstraction.
Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness based on its
unexpectedness, may also be included.
Other examples of domain knowledge are additional interestingness constraints or thresholds, and
metadata (e.g., describing data from multiple heterogeneous sources).
Data mining engine:
This is essential to the data mining system and ideally consists of a set of functional
modules for tasks such as
characterization,
association and correlation analysis,
classification,
prediction,
cluster analysis,
outlier analysis, and
evolution analysis.
Pattern evaluation module:
This component typically employs interestingness measures and interacts with the
data mining modules so as to focus the search toward interesting patterns.
It may use interestingness thresholds to filter out discovered patterns.
Alternatively, the pattern evaluation module may be integrated with the mining
module, depending on the implementation of the data mining method used.
User interface:
This module communicates between users and the data mining system
It is allowing the user to interact with the system by specifying a data mining query or
task
It is providing information to help focus the search, and performing exploratory data
mining based on the intermediate data mining results.
In addition, this component allows the user to browse database and data warehouse
schemas or data structures, evaluate mined patterns, and visualize the patterns in
different forms.
MULTI-DIMENSIONAL VIEW OF DATA MINING
Data to be mined
Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse,
transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs &
social and information networks
Knowledge to be mined (or: Data mining functions)
Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis,
etc.
Descriptive vs. predictive data mining
Multiple/integrated functions and mining at multiple levels
Techniques utilized
Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization,
high-performance, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining,
Web mining, etc.
DATA MINING: ON WHAT KINDS OF DATA?
Mining Methodology
Mining various and new kinds of knowledge
Mining knowledge in multi-dimensional space
Data mining: An interdisciplinary effort
Boosting the power of discovery in a networked environment
Handling noise, uncertainty, and incompleteness of data
Pattern evaluation and pattern- or constraint-guided mining
User Interaction
Interactive mining
Incorporation of background knowledge
Presentation and visualization of data mining results 22