Govindarajan Data Vault PDF
Govindarajan Data Vault PDF
http://www.globytes.com
[email protected]
1
Data Vault Modeling
Agenda
Introduction
Data Vault Place in Evolution
Data Warehousing Architecture
Data Vault Components
Case Study
Typical 3NF / Star Schema
Data Vault Approach
Temporal Data + Auditability
Implementation of Data Vault
Conclusion 2
Introduction
3
Data Vault Place in Evolution
4
Data Warehousing Architecture
Data Mart
OLTP Staging Enterprise Staging Data Mart
(optional) Data
DataMarts
Marts
Systems & ODS DW
ive
a pt arts
Ad M
OLTP
Systems Staging
Data Mart Data Mart
5
Data Vault Components
Hub Entities
Candidate Keys + Load Time + Source
Link Entities
FKs from Hub + Load Time + Source
Satellite Entities
Descriptive Data + Load Time + Source + End Time
Dimension = Hub+ Satellite
6
Case Study
Modeling Approaches
Typical 3NF / Star Schema
Data Vault approach
Scenario
Typical Fact and Dimension
New Dimension data
Reference Data change - 1:M to M:M
Additional attributes to Dimension
7
Typical 3 F / Star Schema
1. Typical Fact and Dimension
8
Typical 3 F / Star Schema
2. ew Dimension Data
R E F P R OD U CT CA T E GOR Y
TR ORDER DIM PRODUCT p ro d uc t_c a te g o ry _id
tr_order_id dim_product_id
p ro d uc t_c a te g o ry _na me
dim_product_id product_id p ro d uc t_c a te g o ry _c o d e
system order id product_code p ro d uc t_c a te g o ry _d e s c
system root order id product desc is _s tra te g y
global root order id product_name
order quantity p ro d uc t_c a te g o ry _id
total executed qty p ro d uc t_c a te g o ry _na me
event timestamp p ro d uc t_c a te g o ry _c o d e
p ro d uc t_c a te g o ry _d e s c
is _s tra te g y
9
Typical 3 F / Star Schema
2. ew Dimension Data
R E F P R OD U CT CA T E GOR Y
TR ORDER DIM PRODUCT p ro d uc t_c a te g o ry _id
tr_order_id dim_product_id
p ro d uc t_c a te g o ry _na me
dim_product_id product_id p ro d uc t_c a te g o ry _c o d e
system order id product_code p ro d uc t_c a te g o ry _d e s c
system root order id product desc is _s tra te g y
global root order id product_name
order quantity le ng th
total executed qty D IM P R OD U CT CA T E GOR Y
event timestamp
d im_p ro d uc t_c a te g o ry _id d im_p ro d uc t_c a te g o ry _id
product_category_code
product_category_name
product_category_desc
is_strategy
namespace
10
Typical 3 F / Star Schema
3. Reference 1:M to M:M Change
R E F P R OD U CT CA T E GOR Y
TR ORDER DIM PRODUCT p ro d uc t_c a te g o ry _id
tr_order_id dim_product_id
p ro d uc t_c a te g o ry _na me
dim_product_id product_id p ro d uc t_c a te g o ry _c o d e
system order id
system root order id
global root order id
product_code
product desc
product_name
? p ro d uc t_c a te g o ry _d e s c
is _s tra te g y
order quantity le ng th
total executed qty D IM P R OD U CT CA T E GOR Y
event timestamp
d im_p ro d uc t_c a te g o ry _id
order owner firm id ? d im_p ro d uc t_c a te g o ry _id
product_category_code
product_category_name
product_category_desc
is_strategy
namespace
11
Typical 3 F / Star Schema
4. Additional attributes to Dimension
12
Data Vault Approach
1. Typical Hub, Satellite and Link
(Replacing Fact/Dimension)
SAT _PR OD U CT 01
HUB_PRODUCT
s a t_p ro d uc t01_id
hub _p ro d uc t_id
hub _p ro d uc t_id
hub _re c o rd _s o urc e s a t_lo a d _d a te
hub _lo a d _d a te s a t_e nd _d a te
product_id s a t_re c o rd _s o urc e
product_code product desc
product_name
symbol
HUB_ORDER
SA T OR D ER 01
LNK_ORDER hub _o rd e r_id
s a t_o rd e r01_id
lnk _o rd e r_id hub _re c o rd _s o urc e
hub _o rd e r_id
hub _lo a d _d a te
lnk _lo a d _d a te s a t_lo a d _d a te
tr_order_id
lnk _re c o rd _s o urc e s a t_e nd _d a te
system order id
hub _p ro d uc t_id s a t_re c o rd _s o urc e
system root order id
hub _o rd e r_id order quantity
global root order id
total executed qty
event timestamp
13
Data Vault Approach
2. ew Dimension Data
SAT_PRODUCT01
HUB_PRODUCT
s a t_p ro d uc t01_id
hub _p ro d uc t_id
hub _p ro d uc t_id
hub _re c o rd _s o urc e s a t_lo a d _d a te
hub _lo a d _d a te s a t_e nd _d a te
product_id s a t_re c o rd _s o urc e
product_code product desc
product_name
symbol
HUB_ORDER
SAT ORDER01
LNK_ORDER hub _o rd e r_id
s a t_o rd e r01_id
lnk _o rd e r_id hub _re c o rd _s o urc e
hub _o rd e r_id
hub _lo a d _d a te
lnk _lo a d _d a te s a t_lo a d _d a te
tr_order_id
lnk _re c o rd _s o urc e s a t_e nd _d a te
system order id
hub _p ro d uc t_id s a t_re c o rd _s o urc e
system root order id
hub _o rd e r_id order quantity
global root order id
total executed qty
event timestamp
S A T _CA T E GOR Y 01
H U B _CA T E GOR Y s a t_c a te g o ry 01_id
LN K_P R OD U CT _X_CA T E GOR Y
hub _c a te g o ry _id hub _c a te g o ry _id
lnk _p ro d uc t_x_c a te g o ry _id
s a t_lo a d _d a te
hub _re c o rd _s o urc e
lnk _lo a d _d a te s a t_e nd _d a te
hub _lo a d _d a te
lnk _re c o rd _s o urc e s a t_re c o rd _s o urc e
product_category_id
hub _p ro d uc t_id product_category_name
product_category_code
hub _c a te g o ry _id product_category_desc
is_strategy
14
Data Vault Approach
3. Reference 1:M to M:M Change
SAT_PRODUCT01
HUB_PRODUCT
s a t_p ro d uc t01_id
hub _p ro d uc t_id
hub _p ro d uc t_id
hub _re c o rd _s o urc e s a t_lo a d _d a te
hub _lo a d _d a te s a t_e nd _d a te
product_id s a t_re c o rd _s o urc e
product_code product desc
product_name
symbol
HUB_ORDER
SAT ORDER01
LNK_ORDER hub _o rd e r_id
s a t_o rd e r01_id
lnk _o rd e r_id hub _re c o rd _s o urc e
hub _o rd e r_id
hub _lo a d _d a te
lnk _lo a d _d a te s a t_lo a d _d a te
tr_order_id
lnk _re c o rd _s o urc e s a t_e nd _d a te
system order id
hub _p ro d uc t_id s a t_re c o rd _s o urc e
system root order id
hub _o rd e r_id order quantity
global root order id
total executed qty
event timestamp
S A T _CA T E GOR Y 01
H U B _CA T E GOR Y s a t_c a te g o ry 01_id
LN K_P R OD U CT _X_CA T E GOR Y
hub _c a te g o ry _id hub _c a te g o ry _id
lnk _p ro d uc t_x_c a te g o ry _id
s a t_lo a d _d a te
hub _re c o rd _s o urc e
lnk _lo a d _d a te s a t_e nd _d a te
hub _lo a d _d a te
lnk _re c o rd _s o urc e s a t_re c o rd _s o urc e
product_category_id
hub _p ro d uc t_id product_category_name
product_category_code
hub _c a te g o ry _id product_category_desc
is_strategy
15
Data Vault Approach
4. Additional attributes to Dimension
SAT_PRODUCT01
s a t_p ro d uc t01_id
hub _p ro d uc t_id
s a t_lo a d _d a te
s a t_e nd _d a te
s a t_re c o rd _s o urc e
product desc
HUB_PRODUCT product_name
hub _p ro d uc t_id
S A T _P R OD U CT 02
hub _re c o rd _s o urc e
hub _lo a d _d a te s a t_p ro d uc t02_id
product_id
hub _p ro d uc t_id
product_code
s a t_lo a d _d a te
s a t_e nd _d a te
s a t_re c o rd _s o urc e
length
height
weight
packaging
color
16
Temporal Data + Auditability
S AT _PR OD U CT 01
s a t_p ro d uc t01_id
hub _p ro d uc t_id
s a t_lo a d _d a te
s a t_e nd _d a te
s a t_re c o rd _s o urc e
product desc
HUB_PRODUCT product_name
hub _p ro d uc t_id
S AT _PR OD U CT 02
hub _re c o rd _s o urc e
hub _lo a d _d a te s a t_p ro d uc t02_id
product_id
hub _p ro d uc t_id
product_code
s a t_lo a d _d a te
s a t_e nd _d a te
s a t_re c o rd _s o urc e
length
height
weight
packaging
color
LNK_PRODUCT_X_CATEGORY
LNK_ORDER HUB_ORDER S AT OR D ER 01
lnk _p ro d uc t_x_c a te g o ry _id
lnk _o rd e r_id hub _o rd e r_id s a t_o rd e r01_id
lnk _lo a d _d a te
lnk _lo a d _d a te hub _re c o rd _s o urc e hub _o rd e r_id
lnk _re c o rd _s o urc e
lnk _re c o rd _s o urc e hub _lo a d _d a te s a t_lo a d _d a te
hub _p ro d uc t_id
hub _o rd e r_id tr_order_id s a t_e nd _d a te
hub _c a te g o ry _id
hub _p ro d uc t_id system order id s a t_re c o rd _s o urc e
hub _c a te g o ry _id system root order id order quantity
global root order id total executed qty
event timestamp
source_system
17
Implementation of Data Vault
Make it Simple
View for Dimension
View For Fact
Data Loading
Take advantage of Technology
DW Appliances
High-throughput storage devices
RDBMS Features
18
Make it Simple
HUB_PRODUCT
SAT_PRODUCT01
SAT_PRODUCT02
19
Make it Simple
HUB_ORDER
SAT_ORDER01
LNK_ORDER
20
Data Loading
Typically data loading for a Data Vault is in the
following sequence.
Hubs for Dimensions
Links for Dimensions
Satellites for Dimensions
Hubs for Fact (if any )
Links for Fact
Satellites for Fact
Refer to Data Vault Series article 5 of Linstedt for
further details
21
DW Appliances
Netezza
Teradata
RedBrick
GreenPlum
22
High-throughput storage devices
23
RDBMS Features
24
Conclusion
In this presentation we have addressed high level overview of Data
Vault Architecture. Topics discussed are
Data Vault concept and architecture
25
Questions ?
? 26
References
27
Special Thanks to
28
The End
Contact Email :
[email protected]
29