Loading Dimensions from a Data Vault Model

Loading data into a Data Vault Model is easy. But how complex is it to extract the data from a Data Vault into a dimensional Data Mart? A Point in Time (PIT) table helps to load Dimensions from a Hub with multiple Satellites.

One often heard concern about Data Vault Modeling is the complexity of ETL processes. While the rules to load Hubs, Links and Satellites are simple and straight-forward, it seems to be more complex to load Dimensions and Facts out of a Data Vault. In my very first blog post about Data Vault (see Data Vault Modeling – My first attempt to walk) I wrote: But even if it would be easier to load data into a Data Vault, it is more complex and expensive to load the Data Marts from a Data Vault because the queries to determine the correct version of each Satellite are not trivial. 

In the meantime, I realized that loading dimensions is indeed not trivial, but much simpler than I expected. The magic behind it is called Point in Time (PIT) Table.

Different Types of Dimensions

The complexity of an ETL process to load a dimension table depends on the type of Slowly Changing Dimension and on the number of Data Vault tables that are used to derive the information of the dimension. Let’s start with the simple cases:

  • Slowly Changing Dimension Type 1: Loading an SCD1 dimension is not a big deal. We have to select the current version of each Satellite and join them with the corresponding Hub. This works for any number of Satellites, and even if multiple Hubs and Links are used to select all dimensional attributes, the extraction of the Data Vault is simple. The only issue is to find the current version of each Satellite. The easiest way is to add an additional attribute to each Satellite table, either an expiry date or an active flag. This increases the ETL complexity to load the Satellites (the expiry date or active flag has to be updated if a new version is loaded), but it is much easier to retrieve the current versions.
  • Slowly Changing Dimension Type 2 with only one Satellite: If a dimension can be loaded from one Hub and one Satellite, we can just join these two Data Vault tables to retrieve all versions for the SCD2 dimension. This even works for multiple Hubs, but only if each of them consists of exactly one Satellite. In a typical Data Vault model, this is usually not the case.

But what happens if we want to load an SCD2 dimension from a Hub with multiple Satellites, or from more than one Hub, each of them containing several Satellites? Here we will have more complex cases:

  • Slowly Changing Dimension Type 2 with more than one Satellite: Because the history tracking is independent for each Satellite, we have to combine the validities of all Satellites to calculate all validity ranges for the dimension table. This can be done with a Point in Time Table, as described below.
  • Slowly Changing Dimension Type 2 with multiple Hubs and Links: When a dimension consists of information from different business entities (e.g. one for each hierarchy level), multiple Hubs have to be joined through the connecting Links. Each of the Hubs (and even the Links) can contain one or more Satellites. In such situations, a so-called Bridge Table is created. In this blog post, I will not go into details about this case – probably in a future post.

Let’s have a detailed look at the case with one Hub with multiple Satellites. In the example below, we see the timeline for a Hub with four Satellites. Each Satellite contains 2 to 3 versions for a particular business key. The combination to be loaded into the SCD2 dimension table will contain 7 different versions.

Versions

Point in Time Table

To calculate the validity range for each version, a Point in Time Table (PIT table) is created in the Data Vault model. It contains all combinations of load dates and is used to derive all valid versions that have to be loaded into an SCD2 dimension table.

A PIT table is connected to a Hub or a Link and is therefore a similar structure as a Satellite. But unlike a normal Satellite, it does not contain descriptive attributes, but the load dates/timestamps of all other Satellites of the same Hub or Link. Let’s explain this with an example: For a Hub H_CUSTOMER, there are four Satellites defined in the Data Vault model. Each of them contains its own history. The address of a customer can change several times, but usually not at the same data as an e-mail address or a phone number. Each of the Satellites may contain multiple versions of data for a particular customer, with different load dates.

PIT Table

The PIT table contains all distinct load dates of all Satellites. For each load date, the newest load date of every Satellite is stored in an additional attribute. This helps to identify what version is valid on a particular date. In the following example, the valid versions for customer 2234 on 22 March 2015 are

  • 16.02.2015 for Satellite S_CUSTOMER_MAIN (S1)
  • 22.03.2015 for Satellite S_CUSTOMER_ADDRESS (S2)
  • 26.02.2015 for Satellite S_CUSTOMER_ONLINE (S3)
  • 12.03.2015 for Satellite S_CUSTOMER_PHONE (S4)

PIT Data

There are two approaches to load the PIT table. One is to select the newest load date of each Satellite after each load. If at least one of them changed since the last load, a new version is inserted into the PIT table. This approach is similar to the incremental load of Satellites.

The other approach is to reload the PIT table from scratch after each load. This sounds more complex than it is. The following SQL statements show how this can easily be implemented:

TRUNCATE TABLE S_CUSTOMER_PIT;
INSERT INTO S_CUSTOMER_PIT
(H_CUSTOMER_SID,
LOAD_DATE,
S1_LOAD_DATE,
S2_LOAD_DATE,
S3_LOAD_DATE,
S4_LOAD_DATE)
WITH LOAD_DATES AS (
SELECT H_CUSTOMER_SID, LOAD_DATE FROM S_CUSTOMER_MAIN
UNION
SELECT H_CUSTOMER_SID, LOAD_DATE FROM S_CUSTOMER_ADDRESS
UNION
SELECT H_CUSTOMER_SID, LOAD_DATE FROM S_CUSTOMER_ONLINE
UNION
SELECT H_CUSTOMER_SID, LOAD_DATE FROM S_CUSTOMER_PHONE
)
SELECT
LD.H_CUSTOMER_SID,
LD.LOAD_DATE,
MAX(S1.LOAD_DATE) OVER
(PARTITION BY LD.H_CUSTOMER_SID ORDER BY LD.LOAD_DATE) S1_LOAD_DATE,
MAX(S2.LOAD_DATE) OVER
(PARTITION BY LD.H_CUSTOMER_SID ORDER BY LD.LOAD_DATE) S2_LOAD_DATE,
MAX(S3.LOAD_DATE) OVER
(PARTITION BY LD.H_CUSTOMER_SID ORDER BY LD.LOAD_DATE) S3_LOAD_DATE,
MAX(S4.LOAD_DATE) OVER
(PARTITION BY LD.H_CUSTOMER_SID ORDER BY LD.LOAD_DATE) S4_LOAD_DATE
FROM LOAD_DATES LD
LEFT JOIN S_CUSTOMER_MAIN S1
ON (S1.H_CUSTOMER_SID = LD.H_CUSTOMER_SID AND S1.LOAD_DATE = LD.LOAD_DATE)
LEFT JOIN S_CUSTOMER_ADDRESS S2
ON (S2.H_CUSTOMER_SID = LD.H_CUSTOMER_SID AND S2.LOAD_DATE = LD.LOAD_DATE)
LEFT JOIN S_CUSTOMER_ONLINE S3
ON (S3.H_CUSTOMER_SID = LD.H_CUSTOMER_SID AND S3.LOAD_DATE = LD.LOAD_DATE)
LEFT JOIN S_CUSTOMER_PHONE S4
ON (S4.H_CUSTOMER_SID = LD.H_CUSTOMER_SID AND S4.LOAD_DATE = LD.LOAD_DATE);
COMMIT;

The subquery factoring clause LOAD_DATES selects all distinct load dates for each key (H_CUSTOMER_SID). With this information, a left outer join on each Satellite is performed. For each Satellite, the newest load date per key is calculated with an analytical function. This query was implemented on an Oracle 12c database, but can be used in any database system that supports ANSI SQL.

History View Layer

For each business entity (i.e. a Hub with all Satellites), we can generate two views:

  • Current View: This view selects all relevant attributes from the current version of each satellites. It is used to retrieve the current state of data and as source to load SCD1 dimensions.
  • History View: In contrast to the Current View, this view selects all versions of data for each business key, including and start and end date for the validity of each version. This view is used for initial loads of SCD2 dimensions.

The Current View can be created without the PIT table. Instead, it is recommended to add an ACTIVE flag to each Satellite. This flag identifies the current version for each key. This reduces the complexity of the view, but instead the ACTIVE flag must be updated when new data is loaded into the Data Vault. With this additional information, the Current View has just to join the Hub with all its Satellites. Outer joins are used for the Satellites to avoid that a customer is not selected because one of the Satellites is empty.

CREATE OR REPLACE VIEW V_CUSTOMER_CURR AS
SELECT
HUB.H_CUSTOMER_SID,
HUB.CUSTOMER_NUMBER,
S1.FIRST_NAME,
S1.LAST_NAME,
S1.DATE_OF_BIRTH,
S1.GENDER,
S2.STREET,
S2.ZIP_CODE,
S2.CITY,
S2.STATE,
S2.COUNTRY,
S3.EMAIL_HOME,
S3.EMAIL_WORK,
S4.PHONE_HOME,
S4.PHONE_WORK,
S4.PHONE_MOBILE,
S4.FAX_HOME,
S4.FAX_WORK
FROM H_CUSTOMER HUB
LEFT JOIN S_CUSTOMER_MAIN S1
ON (S1.H_CUSTOMER_SID = HUB.H_CUSTOMER_SID AND S1.ACTIVE = 'Y')
LEFT JOIN S_CUSTOMER_ADDRESS S2
ON (S2.H_CUSTOMER_SID = HUB.H_CUSTOMER_SID AND S2.ACTIVE = 'Y')
LEFT JOIN S_CUSTOMER_ONLINE S3
ON (S3.H_CUSTOMER_SID = HUB.H_CUSTOMER_SID AND S3.ACTIVE = 'Y')
LEFT JOIN S_CUSTOMER_PHONE S4
ON (S4.H_CUSTOMER_SID = HUB.H_CUSTOMER_SID AND S4.ACTIVE = 'Y');

For the History View, we use the PIT table loaded before. It is joined with the Hub and all connected Satellites. An outer join is used for the Satellites because it may happen that not all Satellites contain data for any timestamp in the past. The VALID_TO date is the load data of the next version – it can be calculated with the analytical function LEAD.

CREATE OR REPLACE VIEW V_CUSTOMER_HIST AS
SELECT
HUB.H_CUSTOMER_SID,
HUB.CUSTOMER_NUMBER,
PIT.LOAD_DATE VALID_FROM,
NVL(LEAD(PIT.LOAD_DATE) OVER
(PARTITION BY PIT.H_CUSTOMER_SID ORDER BY PIT.LOAD_DATE)
,TO_DATE('31.12.9999', 'DD.MM.YYYY')) VALID_TO
,
S1.FIRST_NAME,
S1.LAST_NAME,
S1.DATE_OF_BIRTH,
S1.GENDER,
S2.STREET,
S2.ZIP_CODE,
S2.CITY,
S2.STATE,
S2.COUNTRY,
S3.EMAIL_HOME,
S3.EMAIL_WORK,
S4.PHONE_HOME,
S4.PHONE_WORK,
S4.PHONE_MOBILE,
S4.FAX_HOME,
S4.FAX_WORK
FROM S_CUSTOMER_PIT PIT
JOIN H_CUSTOMER HUB ON (HUB.H_CUSTOMER_SID = PIT.H_CUSTOMER_SID)
LEFT JOIN S_CUSTOMER_MAIN S1
ON (S1.H_CUSTOMER_SID = PIT.H_CUSTOMER_SID AND S1.LOAD_DATE = PIT.S1_LOAD_DATE)
LEFT JOIN S_CUSTOMER_ADDRESS S2
ON (S2.H_CUSTOMER_SID = PIT.H_CUSTOMER_SID AND S2.LOAD_DATE = PIT.S2_LOAD_DATE)
LEFT JOIN S_CUSTOMER_ONLINE S3
ON (S3.H_CUSTOMER_SID = PIT.H_CUSTOMER_SID AND S3.LOAD_DATE = PIT.S3_LOAD_DATE)
LEFT JOIN S_CUSTOMER_PHONE S4
ON (S4.H_CUSTOMER_SID = PIT.H_CUSTOMER_SID AND S4.LOAD_DATE = PIT.S4_LOAD_DATE);

Loading Dimensions

Finally, we come back to our original question: How complex is it to load Dimensions from a Hub with multiple Satellites? With the PIT table and the History View Layer that we built before, most of the work is already done. To load the dimension tables, we can distinguish the following situations:

  • Initial Load of SCD1 Dimension: This is the simplest case. We truncate the dimension table and load all rows from the Current View into the empty table.
  • Initial Load of SCD2 Dimension: Here again, we truncate the dimension table. But for an SCD2 dimension, we load all rows from the History View into the empty table.
  • Incremental Load of SCD1 Dimension: The Current View is used to do a MERGE into the dimension table. All new rows are inserted, and changed rows are updated.
  • Incremental Load of SCD2 Dimension: The Current View is used to do a MERGE into the dimension table. All new and changed rows are inserted, and the previous versions are updated.

As we can see from these four different load scenarios, the History View is only required for an initial load of an SCD2 dimension. In all other situations, the Current View is sufficient to load the dimensions. But in a agile DWH environment, it should always be possible to rebuild Data Marts from scratch. So, it’s a good idea to have a History View for all Hubs and Satellites. With a PIT table, this can be built easily.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s