Main of Report
Main of Report
ACKNOWLEDGEMENT
We wish to express our deepest gratitude for Mr.Jarnail Singh, for his utmost
concern, guidance and encouragement during our major project. We would like
to thank him for his pleasant and insightful interaction with us. We are extremely
grateful for their constant guidance and encouragement and for their help in
gaining access to the resources essential for the successful completion of this
assignment. We would like to thank him for sharing their valuable knowledge and
resources with us and showed utmost co-operation and understanding.
ANKITA AGARWAL
DEEPIKA RAIPURIA
MITS,LAXMANGARH
EXECUTIVE SUMMARY
Data mining is the process of posing various queries and extracting useful information,
patterns and trends often previously unknown from large quantities of the data possibly
stored in databases. The goals of data mining include detecting abnormal patterns and
predicting the future based on past experiences and current trends. Some of the data
mining technique includes those based on rough sets, inductive losic programming,
machine learning and neutral networks.
Data integration
Data selection
Data transformation
Data mining
Pattern evaluation
Knowledge presentation
Data can be stored in many types of databases. One database architecture is data
warehouse, a repository of multiple, heterogeneous data sources. It include data
cleansing, data integration, and on-line analytical processing(OLAP).
BRIEF CONTENTS
2. DATA WAREHOUSE 8
OLAP technology for data mining
5. REFERENCES 20
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Databases
Fig.1.1
The architecture of a typical data mining system have the following major
components as shown in fig1.2.
Pattern evaluation
fig. 1.2
Object-oriented databases
Object-relational databases
Spatial databases
Temporal databases and time-series databases
Text databases and multimedia databases
Heterogeneous databases and legacy databases
World wide web
Data mining functionalities are used to specify the kind of patterns to be found in data
mining tasks. They can be classified into two categories: descriptive and predictive.
Descriptive mining tasks characterize the general properties of the data into database.
Predictive mining tasks perform inference on the current data in order to make
predictions.
Data can be associated with classes or concepts. It can be useful to describe individual
classes or concepts in summarized, concise, and yet precise terms. Such descriptions of a
class or a concept are called class/concept descriptions. These descriptions can be
derived via:
Data Characterization: It is summarization of the general characteristics or features of a
target class of data. The data corresponding to the user-specified class are typically
collected by a database query.
More formally, association rules are of the form X => Y, that is, “A1
^………….^AmB1^…….^Bn”, where Ai(for i in {1,….,m}) and Bj(for j in {1,…,n})
are attribute-value pairs. The association rule X => Y is interpreted as “ database tuples
that satisfy the conditions in X are also likely to satisfy the conditions in Y.”
Classification is the process of finding a set of models that describe and distinguish data
classes or concepts, for the purpose of being able to use the module to predict the class of
objects whose class label is unknown, i.e. training data. The derived model is based on
the analysis of a set of training data. Classification can be used for predicting the class
label of data objects. Prediction referred to both data value prediction and class label
prediction. It also encompasses the identification of distribution trends based on the
available data.
Unlike classification and prediction, clustering analyses data objects without consulting a
known class label. It can be used to generate such labels. The objects are clustered or
grouped based on the principle of maximizing the intraclass similarity and minimizing
the interclass similarity. That is, cluster of objects are formed so that objects within a
cluster have high similarity in comparison to one another, but are very dissimilar to
objects in other clusters. Each cluster that is formed can be viewed as a class of objects,
from which rules can be derived.
A database may contain data objects that do not comply with the general behavior or the
model of the data. These data objects are outliers. Most data mining methods discard
outliers as noise or exceptions. However, in some applications such as fraud detection,
the rare events can be more interesting than the more regularly occurring ones. The
analysis of outlier data is referred to as outlier mining.
1.7 Are all of the patterns interesting
A data mining system ha the potential to generate thousands or even millions of patterns,
or rules.
Several objective measures of pattern interestingness exist. These are based on the
structure of discovered patterns and the statistics underlying them. An objective
measure for association rules of the form X=>Y is rule support, representing the
percentage of transactions from a transaction database that the given rule satisfies.
This is taken to be the probability P(X U Y) where X U Y indicates that a transaction
contain both X and Y, that is, the union of item sets X and Y.
Another objective measure for association rules is confidence, which accesses the
degree of certainty of the detected association. This is taken to be the conditional
probability P(Y|X), that is, the probability that a transaction containing X also
contains Y.
Support and Confidence are defined as:
Support(X=>Y)=P(XUY).
Confidence(X=>Y)=P(Y|X).
A data warehouse is a subject oriented ,integrated ,time variant and non volatile
collection of data in support of management’s decision making process.
The four keywords subject oriented ,integrated ,time variant and non volatile ,and
distinguish data warehouses from other data repository systems, such as relational
database systems, transaction processing systems , and file systems.
The major task of on-line operational database systems is to perform on-line transaction
and query processing. These systems are called on-line transaction processing (OLTP)
systems. They cover most of the day-to-day operations of an organization, such as
purchasing, inventory, manufacturing, banking, payroll, registration, and accounting.
Data warehouse systems, on the other hand, serve users or knowledge workers in the role
of data analysis and decision making. Such systems can organize and present data in
various formats in order to accommodate the diverse needs of the different users. These
systems are known as on-line analytical processing (OLAP) systems .Major
distinguishing features between OLTP and OLAP are summarized as follows:
Many of the characteristics of OLAP systems, such as the use of a multidimensional data
model and concept hierarchies, the association of measures with dimensions, and the
notions of roll up and drill down, also exist in earlier work on statistical databases(SDBs).
A statistical database is a database system that is designed to support statistical
applications. Similarities between the two types of systems are rarely discussed. Mainly
due to differences in terminology and application domains.
OLAP and SDB systems, however, have distinguishing differences. While SDBs tend to
focus on socio-economic applications, OLAP has been targeted for business applications.
Privacy issues regarding concept hierarchies are a major concern for SDBs. For example,
given summarized socio-economic data, it is controversial to allow users to view the
corresponding low level data. Finally, unlike SDBs, OLAP systems are designed for
handling huge amount of data efficiently.
1. The bottom tier is a warehouse database server that is almost always a relational
database system. Data from operational databases and external sources are
extracted using application program interfaces known as gateways. A gateway is
supported by underlying DBMS and allows client programs to generate SQL code
to be executed at a server.
2. The middle tier is an OLAP server that is typically implemented using either (1) a
relational OLAP(ROLAP) model, that is, an extended relational DBMS that
maps operations on multidimensional data to standard relational operations; or (2)
a multidimensional OLAP (MOLAP) model, that is, a special purpose server
that directly implements multidimensional data and operations.
3. The top tier is a client, which contains query and reporting tools, analysis tools,
and /or data mining tools.
Multi-Tiered Architecture
Monitor
other
Metadata & OLAP Server
source Integrator
s
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
Fig2.1
From the architectural point of view, there are three data warehouse models:
Enterprise warehouse: an enterprise warehouse collects all of the information
about subjects spanning the entire organization. It provides corporate-wide data
integration, usually from one or more operational systems or external information
providers, and is cross functional in scope.
Data mart: A data mart contains a subset of corporate wide data that is of value
to a specific group of users. The scope is confined to specific selected subjects.
They are usually implemented on low cost departmental servers that are unix or
windows/NT based.
Multi-Tier Data
Warehouse
Distributed
Data Marts
Enterprise
Data Data
Data
Mart Mart
Warehouse
Fig2.2
2.4 Metadata Repository
Metadata are data about data. When used in data warehouse, metadata are the data that
define warehouse objects. They are created for the data names and definitions of the
given warehouse. Additional metadata are created and captured for time stamping any
extracted data, the source of the extracted data, and missing fields that have been added
by data cleaning or integration processes. A metadata should contain the following:
Among many different paradigms and architectures of data mining systems, on-line
analytical mining(OLAM), which integrates on-line analytical processing(OLAP) with
data mining and mining knowledge in multidimensional databases, is particularly
important for the following reasons :
An OLAM Architecture
Layer2
MDDB MDDB
Meta Data
Fig2.3
An OLAM server performs analytical mining in data cubes in a similar manner as an
OLAP server performs on on-line analytical processing. An integrated OLAM and OLAP
architecture is shown in he fig. where the OLAM and OLAP servers both accept user on-
line queries via an graphical user interface API and work with the data cube in the data
analysis via a cube API. A metadata directory is used to guide the access off tha data
cube.
The data cube can be constructed by accessing and /or integrating multiple databases via
an MDDB API and/or by filtering a data warehouse via a database API that may support
OLEDB or ODBC connections.
Association rule mining searches for interesting relationships among items in a given data
set. The basic concepts of mining association are:
Let J=(i1,i2,…………,im} be a set of items. Let D, the task relevant data, be a set of
database transactions where each transaction T is a set of items such that T is a subset of
J. each transaction is associated with an identifier, called TID. Let A be a set of items. A
transaction T is said to contain A if and only if A is subset of T. An association rule is an
implication of the form A=>B, where A is subset of J, B is subset of J, and A
intersection B equals to null. The rule A=> B holds in the transaction set D with support
S, ewhere S is the percentage of transactions in D that contain AUB. This is taken to be
the probability, P(AUB). The rule A=>B has confidence C in the transaction set D if C is
the percentage of transactions in D containing A that also contains B. this is taken to be
the conditional probability, P(B|A). That is,
Support (A=>B)=P(AUB)
Confidence (A=>B)=P(B|A)
let us look at how Lk-1 is used to find Lk. A two step process i followed, consisting of
join and prune actions.
1. The Join step:To find Lk, a set of candidate K-itemsets is generated by joining Lk-1
with itself. This set of candidates is denoted Ck. Let l1 and l2 be itemsets in Lk-1. The
notation Li[j] refers to the jth item in Li. By covention, apriori assumes that item
within a transaction or itemset are sorted in lexicographic order. The join, Lk-1 to Lk-
1, is performed, where members of Lk-1 are joinable if their first (k-2) items are in
common. That is, members l1 and l2 of Lk-1 are joined if (l1[1]=l2[2] ) and
(l1[2]=l2[2]) and ...and (l1[k-2]=l2[k-2]) and (l1[k-1]<l2[k-1]). The condition l1[k-
1]<l2[k-1] simply insures that no duplicates are generated. The resulting itemset
formed by joining l1 and l2 is l1[1]l1[2]...l1[k-1]l2[k-1].
2. The prune step: Ck is a superset of Lk, that is its member may or may not be frequent,
but all of the frequent k itemsets are included in Ck. A scan of the database to determine
the count of each candidate in Ck would result in the determination of Lk. Ck, however
can be huge, and so this could involve heavy computation. to reduce the size of Ck, the
apriori property is used as follows. Any (k-1) itemset that is not frequent cannot be a
subset of a frequent k itemset. hence, if any (k-1) subset of a candidate k-itemset is not in
Lk-1, then the candidate cannot be frequent either and so can be removd from Ck. This
subset testing can be done quickly by maintaining a hash tree of all frequent itemsets.
In the second step, the model is used for classification. First, the predictive accuracy of
the model is estimated. The hold out method is a simple technique that uses a test set of
class label samples. These samples are randomly selected and are independent of the
training samples. The accuracy of a model on a given test et is the percentage of test set
samples that are correctly classified by the model. For each test sample, the known class-
label is compared with the learned model’s class prediction for that sample. If the
accuracy of the model were estimated based on the training data set, this estimate could
be optimistic since the learned model tends to over fit the data. Therefore a test set is
used.
Prediction can be viewed as the construction and use of a model to access the class of an
unlabeled sample, or to access the value or value range of an attribute that a given sample
is likely to have.
A decision tree is a flow chart like tree structure, where each internal node denotes a test
on an attribute, each branch represents an outcome of the test, and leaf nodes represent
classes or class distributions. The topmost node in the tree is the root node.
In order to classify an unknown sample, the attribute-values of the sample are tested
against the decision tree. A path is traced from the root to a leaf node that holds the class
prediction for that sample. Decision trees can easily be converted to classification rules.
Input: the training samples, samples, represented by discrete valued attributes; the set of
candidate attributes, attribute list.
Method:
(1) create a node N;
(2) if samples are all of the same class, C then
(3) return N as a leaf node labeled with the class C;
(4) If attribute list is emplty then
(5) Return N as a leaf node labekleed with the most common class in samples;
//majority voting.
(6) Select test-attribute, th attribute among the attribute-list with the highest
information gain;
(7) Label node N with test-attribute;
(8) For each known value ai of test-attribute //partition the samples
(9) Grow a branch from node N for the condition test attribute=ai ;
(10) Let si be the set of samples in samples for which test attribute = ai;
(11) If si is empty then
(12) Attach a leaf labeled with the most common class in samples;
(13) Else attach the node returned by Generate_decision_tree(si,attribute-list-test-
attribute);
REFERENCES
DATA MINING-CONCEPTS AND TECHNIQUES
- JIAWEI HAN
- MICHELINE KAMBER