0% found this document useful (0 votes)
9 views

DMDW Imp Ques

Uploaded by

shvetac02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

DMDW Imp Ques

Uploaded by

shvetac02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

DMDW IMP QUES

UNIT 1
2M
1)WHY DATA MINING IS USED
➢ Better Decision-Making: Provides actionable insights for informed choices.
➢ Operational Efficiency: Improves processes and reduces costs.
➢ Risk Management: Detects fraud and predicts potential risks.
➢ Innovation: Generates new ideas and opportunities.

2)DEFINE DATA MINING


Data mining is the process of discovering patterns, correlations, and insights from large datasets
using statistical, mathematical, and computational techniques. It involves analysing data to extract
useful information and transform it into actionable knowledge.

3)GIVE FEW APPLICATIONS OF DATA MINING


➢ Marketing: Customer segmentation and targeted campaigns.
➢ Finance: Fraud detection and credit scoring.
➢ Healthcare: Predicting patient outcomes and optimizing treatments.
➢ Retail: Inventory management and product recommendations.
4)DEFINE BUCKETING
Bucketing is a data pre-processing method used to minimize the effects of small observation errors.

There are 2 methods of dividing data into bins:

• Equal Frequency Binning: bins have an equal frequency.

• Equal Width Binning: bins have equal width with a range of each bin are defined as [min +
w], [min + 2w] …. [min + nw] where w = (max – min) / (no of bins).
5)SOLVE PROBLEMS FOR NORMALIZATION
i)Min-max normalization: to [new_minA, new_maxA]

v − min A
v' = (new _ max A − new _ min A) + new _ min A
max A − min A
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to

73,600 − 12,000
(1.0 − 0) + 0 = 0.716
98,000 − 12,000
ii) Normalization by decimal scaling

v
v' =
10 j Where j is the smallest integer such that Max(|ν’|) < 1

iii) Z-score normalization (μ: mean, σ: standard deviation)

v − A
v' =
 A
Eg:
For all values 3-21.1/29.8 = -0.6 Do for all values.
12M
1)EXPLAIN KNOWLEDGE DISCOVERY PROCESS (KDD)

Data cleaning

To remove noise and inconsistent data

Data integration

Where multiple data sources may be combined

Data selection

Where data relevant to the analysis task are retrieved from the database

Data transformation

Where data are transformed and consolidated into forms appropriate for mining by performing
summary or aggregation operations

Data mining

An essential process where intelligent methods are applied to extract data patterns

Pattern evaluation

To identify the truly interesting patterns representing knowledge based on interestingness measures

Knowledge presentation

Where visualization and knowledge representation techniques are used to present mined knowledge
to users
Data to be mined

Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse,


transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs
& social and information networks

Knowledge to be mined (or: Data mining functions)

Characterization, discrimination, association, classification, clustering, trend/deviation, outlier


analysis, etc.

Descriptive vs. predictive data mining

Multiple/integrated functions and mining at multiple levels

Techniques utilized

Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition,


visualization, high-performance, etc.

Applications adapted

Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text
mining, Web mining, etc.

2)EXPLAIN SYSTEM ARCHITECTURE OF DATA MINING


Components of data mining systems

Data source

➢ The actual source of data is the Database, data warehouse, World Wide Web (WWW),
text files, and other documents.
➢ We need a huge amount of historical data for data mining to be successful.

Data mining engine

➢ It comprises instruments and software used to obtain insights and knowledge from data
collected from various data sources and stored within the data warehouse.
➢ It contains several modules for operating data mining tasks, including association,
characterization, classification, clustering, prediction, time-series analysis, etc.

Data warehouse server

➢ The database or data warehouse server consists of the original data that is ready to be
processed.
➢ The server is cause for retrieving the relevant data that is based on data mining as per
user request.

Pattern evaluation module

➢ It is primarily responsible for the measure of investigation of the pattern by using a threshold
value. It collaborates with the data mining engine to focus the search on exciting patterns.

Graphical user interface

➢ The graphical user interface (GUI) module communicates between the data mining system
and the user.
➢ This module helps the user to easily and efficiently use the system without knowing the
complexity of the process.
➢ This module cooperates with the data mining system when the user specifies a query or a
task and displays the results.

Knowledge base

➢ It might be helpful to guide the search or evaluate the stake of the result patterns.
➢ The knowledge base may even contain user views and data from user experiences that
might be helpful in the data mining process.
➢ The pattern assessment module regularly interacts with the knowledge base to get
inputs, and also update it.
3)EXPLAIN DATA PREPROCESSING IN DETAIL
Data processing refers to the collection, manipulation, and management of data to extract
meaningful information and insights. It involves various steps to transform raw data into a structured
format that can be analysed and utilized effectively.

Steps or Major Tasks in Data Preprocessing

Data cleaning

Data in the Real World Is Dirty as lots of potentially incorrect data, e.g., instrument faulty, human or
computer error, transmission error

i)incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data

e.g., Occupation= “” (missing data)

ii)noisy: containing noise, errors, or outliers

e.g., Salary= “−10” (an error)

Data integration

➢ Combines data from multiple sources into a coherent store (Data Warehouse)
➢ Schema integration: e.g., A. cust-id  B. cust-#

Detecting and resolving data value conflicts

➢ For the same real-world entity, attribute values from different sources are different
➢ Possible reasons: different representations, different scales, e.g., metric vs. British units

Data reduction

➢ Obtains a reduced representation of the data set that is much smaller in volume but yet
produces the same (or almost the same) analytical results
➢ A database/data warehouse may store terabytes of data. Complex data analysis may
take a very long time to run on the complete data set.

Data transformation and data discretization

➢ A function that maps the entire set of values of a given attribute to a new set of replacement
values s.t. each old value can be identified with one of the new values
➢ Methods
➢ Smoothing: Remove noise from data
➢ Normalization: Scaled to fall within a smaller, specified range
➢ min-max normalization
➢ z-score normalization
➢ normalization by decimal scaling

Discretization: Divide the range of a continuous attribute into intervals


4)EXPLAIN SMOOTHING BY BINS WITH AN EXAMPLE
➢ Smoothing by bin means: In smoothing by bin means, each value in a bin is replaced by
the mean value of the bin.
➢ Smoothing by bin median: In this method each bin value is replaced by its bin median
value.
➢ Smoothing by bin boundary: In smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the bin boundaries. Each bin value is
then replaced by the closest boundary value.

METHOD

➢ Sort the array of a given data set.


➢ Divides the range into N intervals, each containing the approximately same number of
samples (Equal-depth partitioning).
➢ Store mean/ median/ boundaries in each row.

PROBLEM

Data for price (in dollars): 9, 8, 4, 15, 24, 21, 21, 25, 26, 34, 29, 28

SOLN

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
UNIT 2
2M
1)DEFINE DATA WAREHOUSE
➢ A data warehouse is a collection of data marts representing historical data from different
operations in the company.

➢ It collects the data from multiple heterogeneous data base files (flat, text and etc).

➢ It stores the 5 to 10 years of huge amount of data. This data is stored in a structure
optimized for querying and data analysis as a data warehouse.

2)GIVE THE KEY PROPERTIES OF A DATA WAREHOUSE


➢ Subject Oriented: Data that gives information about a particular subject instead of about a
company’s ongoing operations.
➢ Integrated: Data that is gathered into the data warehouse from a variety of sources and
merged into a coherent whole.
➢ Time-variant: All data in the data warehouse is identified with a particular time period.
➢ Non-volatile: Data is stable in a data warehouse. More data is added but data is never
removed.

3)GIVE THE DATA WAREHOUSE CHARACTERISTICS


➢ It is a database designed for analytical tasks

➢ Its content is periodically updated

➢ It contains current and historical data to provide historical perspective of information.

4)ADVANTAGES OF MULTI-DIMENSIONAL DATA MODEL


➢ A multi-dimensional data model is easy to handle.
➢ It is easy to maintain.
➢ Its performance is better than that of normal databases
➢ The representation of data is better than traditional databases. That is because the
multi-dimensional databases are multi-viewed and carry different types of factors.

5)DISADVANTAGES OF MULTI-DIMENSIONAL DATA MODEL


➢ The multi-dimensional Data Model is slightly complicated in nature and it requires
professionals to recognize and examine the data in the database.
➢ During the work of a Multi-Dimensional Data Model, when the system caches, there is a
great effect on the working of the system.
➢ It is complicated in nature due to which the databases are generally dynamic in design.
6) OLAP OPERATIONS
Drill down

In drill-down operation, the less detailed data is converted into highly detailed data. It can be done
by:

➢ Moving down in the concept hierarchy


➢ Adding a new dimension

Roll up

It is just opposite of the drill-down operation. It performs aggregation on the OLAP cube. It can be
done by:

➢ Climbing up in the concept hierarchy


➢ Reducing the dimensions

Dice

➢ It selects a sub-cube from the OLAP cube by selecting two or more dimensions.

Slice

It selects a single dimension from the OLAP cube which results in a new sub-cube creation.

Pivot

It is also known as rotation operation as it rotates the current view to get a new view of the
representation.

7)DIFFERENTITAE BW OLAP AND OLTP


OLAP (Online analytical processing) OLTP (Online transaction processing)

Consists of historical data from various Databases. Consists only operational current data.

It is subject oriented. Used for Data Mining, It is application oriented. Used for business tasks.
Analytics, Decision making, etc.

The data is used in planning, problem solving and The data is used to perform day to day fundamental
decision making. operations.

Large amount of data is stored typically in TB, PB The size of the data is relatively small as the historical
data is archived. For ex MB, GB

Relatively slow as the amount of data involved is Very Fast as the queries operate on 5% of the data.
large. Queries may take hours.
It only need backup from time to time as compared
Backup and recovery process is maintained religiously
to OLTP.

This data is generally managed by CEO, MD, GM. This data is managed by clerks, managers.

Only read and rarely write operation. Both read and write operations.
12M
1)EXPLAIN THE ARCHITECTURE OR COMPONENTS OF DATA WAREHOUSING

➢ The data warehouse architecture is based on the data base management system server.

➢ The central information repository is surrounded by number of key components

➢ Data warehouse is an environment, not a product which is based on relational database


management system

➢ The data entered into the data warehouse transformed into an integrated structure and
format and transformation process involves conversion, summarization, filtering.

➢ The data warehouse must be capable of holding and managing large volumes of data as well
as different structure of data structures over the time.

Key components

➢ Data sourcing, cleanup, transformation, and migration tools

➢ Metadata repository

➢ Warehouse/database technology

➢ Data marts, Information delivery system

➢ Data query, reporting, analysis, and mining tools

➢ Data warehouse administration and management


Data sourcing, cleanup, transformation, and migration tools

➢ They perform conversions, summarization, key changes, structural changes

➢ The data transformation is required to use by decision support tools.

➢ The transformation produces programs, control statements.

➢ It moves the data into data warehouse from multiple operational systems.

The Functionalities of these tools are listed below:

➢ To remove unwanted data from operational db

➢ Converting to common data names and attributes

➢ Calculating summaries and derived data

➢ Establishing defaults for missing data

➢ Accommodating source data definition changes

Metadata repository

➢ It is data about data. It is used for maintaining, managing and using the data warehouse.

Data ware house database

➢ This is the central part of the data ware housing environment. This is implemented based on
RDBMS technology

Data marts

It is inexpensive tool and alternative to the data ware house. it based on the subject area Data mart
is used in the following situation:

➢ Extremely urgent user requirement

➢ The absence of a budget for a full-scale data warehouse strategy

➢ The decentralization of business needs

Query and reporting tools

Used to generate query and report

➢ Production reporting tool used to generate regular operational reports

➢ Desktop report writer is inexpensive desktop tools designed for end users.

Application development tools

This is a graphical data access environment which integrates OLAP tools with data warehouse and
can be used to access all db systems.

➢ OLAP Tools: Are used to analyze the data in multidimensional and complex views.

➢ Data mining tools: Are used to discover knowledge from the data warehouse data.
2)EXPLAIN HOW DO BUILD A DATA WAREHOUSE
Business factors:
➢ Business users want to make decision quickly and correctly using all available data.
Top – Down Approach It collected enterprise-wide business requirements and decided to
build an enterprise data warehouse with subset data marts.
Bottom-Up Approach The data marts are integrated or combined together to form a data
warehouse.
➢ Developing and integrating data marts as and when the requirements are clear.
➢ The advantage of using the Bottom-Up approach is that they do not require high
initial costs and have a faster implementation time;
Technological factors:
➢ To address the incompatibility of operational data stores
➢ IT infrastructure is changing rapidly. Its capacity is increasing and cost is decreasing
so that building a data warehouse is easy
Design considerations:
➢ In general, a data warehouse data from multiple heterogeneous sources into a query
database this is also one of the reasons why a data warehouse is difficult to build
Data content
➢ The content and structure of the data warehouse are reflected in its data model.
Meta data
➢ It defines the location and contents of data in the warehouse.
➢ Meta data is searchable by users to find definitions or subject areas.
Data distribution
➢ Data volumes continue to grow in nature. Therefore, it becomes necessary to know
how the data should be divided across multiple servers.
➢ The data can be distributed based on the subject area, location (geographical region),
or time (current, month, year)
Hardware platforms
➢ An important consideration when choosing a data warehouse server capacity for
handling the high volumes of data.
➢ It has large data and through put.
➢ The modern server can also support large volumes and large number of flexible GUI
3)EXPLAIN THE THREE-TIER DATA WAREHOUSE ARCHITECTURE

Data Warehouses usually have a three-level (tier) architecture that includes:

Bottom Tier (Data Warehouse Server)

A bottom-tier that consists of the Data Warehouse server, which is almost always an RDBMS. It may
include several specialized data marts and a metadata repository.

Data from operational databases and external sources are extracted using application program
interfaces called a gateway. A gateway is provided by the underlying DBMS and allows customer
programs to generate SQL code to be executed at a server.

Examples of gateways contain ODBC (Open Database Connection) and JDBC (Java Database
Connection).

Top Tier (Front end Tools).

A top-tier that contains front-end tools for displaying results provided by OLAP, as well as additional
tools for data mining of the OLAP-generated data.
Middle Tier (OLAP Server)

A middle-tier which consists of an OLAP server for fast querying of the data warehouse.

(1) A Relational OLAP model, i.e., an extended relational DBMS that maps functions on
multidimensional data to standard relational operations.

(2) A Multidimensional OLAP model, i.e., a particular purpose server that directly implements
multidimensional information and operations.

The metadata repository stores information that defines DW objects. It includes the following
parameters and information for the middle and the top-tier applications:

1. A description of the DW structure, including the warehouse schema, dimension, hierarchies,


data mart locations, and contents, etc.

2. Operational metadata, which usually describes the currency level of the stored data, i.e.,
active, archived or purged, and warehouse monitoring information, i.e., usage statistics,
error reports, audit, etc.

Load Performance

Data warehouses require increase loading of new data periodically basis within narrow time
windows; performance on the load process should be measured in hundreds of millions of rows and
gigabytes per hour and must not artificially constrain the volume of data business.

Load Processing

Many phases must be taken to load new or update data into the data warehouse, including data
conversion, filtering, reformatting, indexing, and metadata update.

Data Quality Management

Fact-based management demands the highest data quality. The warehouse ensures local consistency,
global consistency, and referential integrity despite "dirty" sources and massive database size.

Query Performance

Fact-based management must not be slowed by the performance of the data warehouse RDBMS;
large, complex queries must be complete in seconds, not days.
4)EXPLAIN THE SCHEMAS FOR MULTI-DIMENSIONAL DATA MODEL
Schema is a logical description of the entire database. It includes the name and description of
records of all record types including all associated data-items and aggregates.

Much like a database, a data warehouse also requires to maintain a schema. A database uses
relational model, while a data warehouse uses Star, Snowflake, and Fact Constellation schema.

Star Schema

➢ Each dimension in a star schema is represented with only one-dimension table and this
dimension table contains the set of attributes.

➢ The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.

➢ There is a fact table at the center. It contains the keys to each of four dimensions.

➢ The fact table also contains the attributes, namely dollars sold and units sold.

Snowflake Schema

➢ Some dimension tables in the Snowflake schema are normalized and the normalization splits
up the data into additional tables.

➢ The dimensions table in a snowflake schema are normalized. For example, the item
dimension table in star schema is normalized and split into two-dimension tables, namely
item and supplier table.

➢ Now the item dimension table contains the attributes item_key, item_name, type, brand,
and supplier-key.
Fact Constellation Schema

➢ A fact constellation has multiple fact tables. It is also known as galaxy schema.

➢ The shipping fact table has the five dimensions, namely item_key, time_key, shipper_key,
from_location, to_location.

➢ The shipping fact table also contains two measures, namely dollars sold and units sold.

It is also possible to share dimension tables between fact tables. For example, time, item, and
location dimension tables are shared between the sales and shipping fact table.

You might also like