Loading Dimensions from a Data Vault Model

Loading data into a Data Vault Model is easy. But how complex is it to extract the data from a Data Vault into a dimensional Data Mart? A Point in Time (PIT) table helps to load Dimensions from a Hub with multiple Satellites.

One often heard concern about Data Vault Modeling is the complexity of ETL processes. While the rules to load Hubs, Links and Satellites are simple and straight-forward, it seems to be more complex to load Dimensions and Facts out of a Data Vault. In my very first blog post about Data Vault (see Data Vault Modeling – My first attempt to walk) I wrote: But even if it would be easier to load data into a Data Vault, it is more complex and expensive to load the Data Marts from a Data Vault because the queries to determine the correct version of each Satellite are not trivial. 

In the meantime, I realized that loading dimensions is indeed not trivial, but much simpler than I expected. The magic behind it is called Point in Time (PIT) Table.

Different Types of Dimensions

The complexity of an ETL process to load a dimension table depends on the type of Slowly Changing Dimension and on the number of Data Vault tables that are used to derive the information of the dimension. Let’s start with the simple cases:

  • Slowly Changing Dimension Type 1: Loading an SCD1 dimension is not a big deal. We have to select the current version of each Satellite and join them with the corresponding Hub. This works for any number of Satellites, and even if multiple Hubs and Links are used to select all dimensional attributes, the extraction of the Data Vault is simple. The only issue is to find the current version of each Satellite. The easiest way is to add an additional attribute to each Satellite table, either an expiry date or an active flag. This increases the ETL complexity to load the Satellites (the expiry date or active flag has to be updated if a new version is loaded), but it is much easier to retrieve the current versions.
  • Slowly Changing Dimension Type 2 with only one Satellite: If a dimension can be loaded from one Hub and one Satellite, we can just join these two Data Vault tables to retrieve all versions for the SCD2 dimension. This even works for multiple Hubs, but only if each of them consists of exactly one Satellite. In a typical Data Vault model, this is usually not the case.

But what happens if we want to load an SCD2 dimension from a Hub with multiple Satellites, or from more than one Hub, each of them containing several Satellites? Here we will have more complex cases:

  • Slowly Changing Dimension Type 2 with more than one Satellite: Because the history tracking is independent for each Satellite, we have to combine the validities of all Satellites to calculate all validity ranges for the dimension table. This can be done with a Point in Time Table, as described below.
  • Slowly Changing Dimension Type 2 with multiple Hubs and Links: When a dimension consists of information from different business entities (e.g. one for each hierarchy level), multiple Hubs have to be joined through the connecting Links. Each of the Hubs (and even the Links) can contain one or more Satellites. In such situations, a so-called Bridge Table is created. In this blog post, I will not go into details about this case – probably in a future post.

Let’s have a detailed look at the case with one Hub with multiple Satellites. In the example below, we see the timeline for a Hub with four Satellites. Each Satellite contains 2 to 3 versions for a particular business key. The combination to be loaded into the SCD2 dimension table will contain 7 different versions.

Versions

Point in Time Table

To calculate the validity range for each version, a Point in Time Table (PIT table) is created in the Data Vault model. It contains all combinations of load dates and is used to derive all valid versions that have to be loaded into an SCD2 dimension table.

A PIT table is connected to a Hub or a Link and is therefore a similar structure as a Satellite. But unlike a normal Satellite, it does not contain descriptive attributes, but the load dates/timestamps of all other Satellites of the same Hub or Link. Let’s explain this with an example: For a Hub H_CUSTOMER, there are four Satellites defined in the Data Vault model. Each of them contains its own history. The address of a customer can change several times, but usually not at the same data as an e-mail address or a phone number. Each of the Satellites may contain multiple versions of data for a particular customer, with different load dates.

PIT Table

The PIT table contains all distinct load dates of all Satellites. For each load date, the newest load date of every Satellite is stored in an additional attribute. This helps to identify what version is valid on a particular date. In the following example, the valid versions for customer 2234 on 22 March 2015 are

  • 16.02.2015 for Satellite S_CUSTOMER_MAIN (S1)
  • 22.03.2015 for Satellite S_CUSTOMER_ADDRESS (S2)
  • 26.02.2015 for Satellite S_CUSTOMER_ONLINE (S3)
  • 12.03.2015 for Satellite S_CUSTOMER_PHONE (S4)

PIT Data

There are two approaches to load the PIT table. One is to select the newest load date of each Satellite after each load. If at least one of them changed since the last load, a new version is inserted into the PIT table. This approach is similar to the incremental load of Satellites.

The other approach is to reload the PIT table from scratch after each load. This sounds more complex than it is. The following SQL statements show how this can easily be implemented:

TRUNCATE TABLE S_CUSTOMER_PIT;
INSERT INTO S_CUSTOMER_PIT
(H_CUSTOMER_SID,
LOAD_DATE,
S1_LOAD_DATE,
S2_LOAD_DATE,
S3_LOAD_DATE,
S4_LOAD_DATE)
WITH LOAD_DATES AS (
SELECT H_CUSTOMER_SID, LOAD_DATE FROM S_CUSTOMER_MAIN
UNION
SELECT H_CUSTOMER_SID, LOAD_DATE FROM S_CUSTOMER_ADDRESS
UNION
SELECT H_CUSTOMER_SID, LOAD_DATE FROM S_CUSTOMER_ONLINE
UNION
SELECT H_CUSTOMER_SID, LOAD_DATE FROM S_CUSTOMER_PHONE
)
SELECT
LD.H_CUSTOMER_SID,
LD.LOAD_DATE,
LEAD(LD.LOAD_DATE) OVER
(PARTITION BY LD.H_CUSTOMER_SID ORDER BY LD.LOAD_DATE) LOAD_END_DATE,
MAX(S1.LOAD_DATE) OVER
(PARTITION BY LD.H_CUSTOMER_SID ORDER BY LD.LOAD_DATE) S1_LOAD_DATE,
MAX(S2.LOAD_DATE) OVER
(PARTITION BY LD.H_CUSTOMER_SID ORDER BY LD.LOAD_DATE) S2_LOAD_DATE,
MAX(S3.LOAD_DATE) OVER
(PARTITION BY LD.H_CUSTOMER_SID ORDER BY LD.LOAD_DATE) S3_LOAD_DATE,
MAX(S4.LOAD_DATE) OVER
(PARTITION BY LD.H_CUSTOMER_SID ORDER BY LD.LOAD_DATE) S4_LOAD_DATE
FROM LOAD_DATES LD
LEFT JOIN S_CUSTOMER_MAIN S1
ON (S1.H_CUSTOMER_SID = LD.H_CUSTOMER_SID AND S1.LOAD_DATE = LD.LOAD_DATE)
LEFT JOIN S_CUSTOMER_ADDRESS S2
ON (S2.H_CUSTOMER_SID = LD.H_CUSTOMER_SID AND S2.LOAD_DATE = LD.LOAD_DATE)
LEFT JOIN S_CUSTOMER_ONLINE S3
ON (S3.H_CUSTOMER_SID = LD.H_CUSTOMER_SID AND S3.LOAD_DATE = LD.LOAD_DATE)
LEFT JOIN S_CUSTOMER_PHONE S4
ON (S4.H_CUSTOMER_SID = LD.H_CUSTOMER_SID AND S4.LOAD_DATE = LD.LOAD_DATE);
COMMIT;

The subquery factoring clause LOAD_DATES selects all distinct load dates for each key (H_CUSTOMER_SID). With this information, a left outer join on each Satellite is performed. For each Satellite, the newest load date per key is calculated with an analytical function. This query was implemented on an Oracle 12c database, but can be used in any database system that supports ANSI SQL.

The attribute LOAD_END_DATE is optional, but is very convenient for getting the current versions, as we will see in the Current View of the next paragraph.

History View Layer

For each business entity (i.e. a Hub with all Satellites), we can generate two views:

  • Current View: This view selects all relevant attributes from the current version of each satellite. It is used to retrieve the current state of data and as source to load SCD1 dimensions.
  • History View: In contrast to the Current View, this view selects all versions of data for each business key, including and start and end date for the validity of each version. This view is used for initial loads of SCD2 dimensions.

The Current View reads the PIT table loaded before. It is joined with the Hub and all connected Satellites. An outer join is used for the Satellites because it may happen that not all Satellites contain data for any timestamp in the past. Outer joins are used for the Satellites to avoid that a customer is not selected because one of the Satellites is empty. Because we are interested in the current versions only, only the PIT rows with no LOAD_END_DATE are selected.

CREATE OR REPLACE VIEW V_CUSTOMER_CURR AS
SELECT
HUB.H_CUSTOMER_SID,
HUB.CUSTOMER_NUMBER,
S1.FIRST_NAME,
S1.LAST_NAME,
S1.DATE_OF_BIRTH,
S1.GENDER,
S2.STREET,
S2.ZIP_CODE,
S2.CITY,
S2.STATE,
S2.COUNTRY,
S3.EMAIL_HOME,
S3.EMAIL_WORK,
S4.PHONE_HOME,
S4.PHONE_WORK,
S4.PHONE_MOBILE,
S4.FAX_HOME,
S4.FAX_WORK
FROM S_CUSTOMER_PIT PIT
JOIN H_CUSTOMER HUB ON (HUB.H_CUSTOMER_SID = PIT.H_CUSTOMER_SID)
LEFT JOIN S_CUSTOMER_MAIN S1
ON (S1.H_CUSTOMER_SID = PIT.H_CUSTOMER_SID AND S1.LOAD_DATE = PIT.S1_LOAD_DATE)
LEFT JOIN S_CUSTOMER_ADDRESS S2
ON (S2.H_CUSTOMER_SID = PIT.H_CUSTOMER_SID AND S2.LOAD_DATE = PIT.S2_LOAD_DATE)
LEFT JOIN S_CUSTOMER_ONLINE S3
ON (S3.H_CUSTOMER_SID = PIT.H_CUSTOMER_SID AND S3.LOAD_DATE = PIT.S3_LOAD_DATE)
LEFT JOIN S_CUSTOMER_PHONE S4
ON (S4.H_CUSTOMER_SID = PIT.H_CUSTOMER_SID AND S4.LOAD_DATE = PIT.S4_LOAD_DATE)
WHERE PIT.LOAD_END_DATE IS NULL;

The History View looks almost the same as the Current View, except that all rows from the PIT table are selected, not only the current ones. Additionally, two columns VALID_FROM and VALID_TO are provided in the view. They can be used to retrieve all versions that are valid on a particular snapshot date. To make it easier to select the data, the end date of the last version (which is NULL) is replaced by the maximal date 31 December 9999.

CREATE OR REPLACE VIEW V_CUSTOMER_HIST AS
SELECT
HUB.H_CUSTOMER_SID,
HUB.CUSTOMER_NUMBER,
PIT.LOAD_DATE VALID_FROM,
NVL(PIT.LOAD_END_DATE, TO_DATE('31.12.9999', 'DD.MM.YYYY')) VALID_TO,
S1.FIRST_NAME,
S1.LAST_NAME,
S1.DATE_OF_BIRTH,
S1.GENDER,
S2.STREET,
S2.ZIP_CODE,
S2.CITY,
S2.STATE,
S2.COUNTRY,
S3.EMAIL_HOME,
S3.EMAIL_WORK,
S4.PHONE_HOME,
S4.PHONE_WORK,
S4.PHONE_MOBILE,
S4.FAX_HOME,
S4.FAX_WORK
FROM S_CUSTOMER_PIT PIT
JOIN H_CUSTOMER HUB ON (HUB.H_CUSTOMER_SID = PIT.H_CUSTOMER_SID)
LEFT JOIN S_CUSTOMER_MAIN S1
ON (S1.H_CUSTOMER_SID = PIT.H_CUSTOMER_SID AND S1.LOAD_DATE = PIT.S1_LOAD_DATE)
LEFT JOIN S_CUSTOMER_ADDRESS S2
ON (S2.H_CUSTOMER_SID = PIT.H_CUSTOMER_SID AND S2.LOAD_DATE = PIT.S2_LOAD_DATE)
LEFT JOIN S_CUSTOMER_ONLINE S3
ON (S3.H_CUSTOMER_SID = PIT.H_CUSTOMER_SID AND S3.LOAD_DATE = PIT.S3_LOAD_DATE)
LEFT JOIN S_CUSTOMER_PHONE S4
ON (S4.H_CUSTOMER_SID = PIT.H_CUSTOMER_SID AND S4.LOAD_DATE = PIT.S4_LOAD_DATE);

Loading Dimensions

Finally, we come back to our original question: How complex is it to load Dimensions from a Hub with multiple Satellites? With the PIT table and the History View Layer that we built before, most of the work is already done. To load the dimension tables, we can distinguish the following situations:

  • Initial Load of SCD1 Dimension: This is the simplest case. We truncate the dimension table and load all rows from the Current View into the empty table.
  • Initial Load of SCD2 Dimension: Here again, we truncate the dimension table. But for an SCD2 dimension, we load all rows from the History View into the empty table.
  • Incremental Load of SCD1 Dimension: The Current View is used to do a MERGE into the dimension table. All new rows are inserted, and changed rows are updated.
  • Incremental Load of SCD2 Dimension: The Current View is used to do a MERGE into the dimension table. All new and changed rows are inserted, and the previous versions are updated.

As we can see from these four different load scenarios, the History View is only required for an initial load of an SCD2 dimension. In all other situations, the Current View is sufficient to load the dimensions. But in a agile DWH environment, it should always be possible to rebuild Data Marts from scratch. So, it’s a good idea to have a History View for all Hubs and Satellites. With a PIT table, this can be built easily.

20 thoughts on “Loading Dimensions from a Data Vault Model

    • Hi David
      No, with the condition S1.LOAD_DATE <= LD.LOAD_DATE you will get duplicates. The query selects all distinct load dates from all satellites in the subquery factoring clause LOAD_DATES. After this step, equi-joins are possible to join the satellites.
      Cheers, Dani.

      Like

      • Hi Dani,

        Why do you use MAX(S1.LOAD_DATE) then ?

        The query should be :

        INSERT INTO S_CUSTOMER_PIT
        (H_CUSTOMER_SID,
        LOAD_DATE,
        S1_LOAD_DATE,
        S2_LOAD_DATE,
        S3_LOAD_DATE,
        S4_LOAD_DATE)
        WITH LOAD_DATES AS (
        SELECT H_CUSTOMER_SID, LOAD_DATE FROM S_CUSTOMER_MAIN
        UNION
        SELECT H_CUSTOMER_SID, LOAD_DATE FROM S_CUSTOMER_ADDRESS
        UNION
        SELECT H_CUSTOMER_SID, LOAD_DATE FROM S_CUSTOMER_ONLINE
        UNION
        SELECT H_CUSTOMER_SID, LOAD_DATE FROM S_CUSTOMER_PHONE
        )
        SELECT
        LD.H_CUSTOMER_SID,
        LD.LOAD_DATE,
        MAX(S1.LOAD_DATE) S1_LOAD_DATE,
        MAX(S2.LOAD_DATE) S2_LOAD_DATE,
        MAX(S3.LOAD_DATE) S3_LOAD_DATE,
        MAX(S4.LOAD_DATE) S4_LOAD_DATE
        FROM LOAD_DATES LD
        LEFT JOIN S_CUSTOMER_MAIN S1
        ON (S1.H_CUSTOMER_SID = LD.H_CUSTOMER_SID AND S1.LOAD_DATE <= LD.LOAD_DATE)
        LEFT JOIN S_CUSTOMER_ADDRESS S2
        ON (S2.H_CUSTOMER_SID = LD.H_CUSTOMER_SID AND S2.LOAD_DATE <= LD.LOAD_DATE)
        LEFT JOIN S_CUSTOMER_ONLINE S3
        ON (S3.H_CUSTOMER_SID = LD.H_CUSTOMER_SID AND S3.LOAD_DATE <= LD.LOAD_DATE)
        LEFT JOIN S_CUSTOMER_PHONE S4
        ON (S4.H_CUSTOMER_SID = LD.H_CUSTOMER_SID AND S4.LOAD_DATE <= LD.LOAD_DATE)
        GROUP BY LD.H_CUSTOMER_SID, LD.LOAD_DATE;

        If you don't use the "<=", the query will not return all the information.
        The GROUP BY take care of the duplicate and the MAX function return the most recent date from the satellites.

        Cheers,
        David

        Liked by 2 people

      • Hi David

        Thanks for your comment. Your proposed query returns the same result, but the WINDOW (SORT) operator in the execution plan (for the analytical functions) is not required. Instead, a GROUP BY is performed. All other parts of the execution plan are identical. So, your query is even a bit faster than mine 🙂

        Cheers, Dani.

        Like

  1. Thanks for this example, still waiting for your future post about “Slowly Changing Dimension Type 2 with multiple Hubs and Links”, i have implemented a solution via bridge table, but not fast and very complex.

    Cheers,
    Rashford

    Liked by 1 person

  2. Hi Dani,
    Thanks for posting this example. This was of great help to our project in populating our own Point in Time tables. I’d found numerous posts that talked around the subject but yours is the first that included actual useful code. In fact it was straightforward for me to then write a code generator in our ETL tool to generate the SQL for any PIT table driven by a few simple parameters.
    Cheers,
    Andy

    Liked by 1 person

  3. Mr. Schnider,

    I’ve found the business vault to be the most challenging part of this journey. This is one of the few posts that I’ve found that simplifies how to virturalize dimensions using PIT tables. I have yet to find a simple demonstration on how to pull together FACT data using BRIDGE tables. I am really looking forward to your post on this topic. Thank you for your contribution to this small DV community!

    John

    Like

  4. Is there a post yet for Slowly Changing Dimension Type 2 with multiple Hubs and Links? I have tried bringing the satellite load dates into the bridge and then creating a PIT from there but that is causing duplicates.

    Like

    • Hi Jason

      A bridge table contains all the keys from a set Hubs and Links, but not the Satellites. Typically, a dimension with multiple hierarchy levels contains one bridge table. This table is joined with the PIT tables for each level.

      Because you are not the first reader asking for an example with a bridge table, I think I have to plan a blog post about this topic soon. Hope to find some time this summer.

      Cheers, Dani.

      Like

      • Jason,

        I eagerly await your next blog post regarding an example on how to design more importantly populate a bridge table. In the meantime, if you can point us to any resources that may be helpful in this regard, it would be greatly appreciated.

        Like

  5. Hi John

    In addition to the two Data Vault books I recommended in https://danischnider.wordpress.com/2016/04/13/data-vault-book-recommendations/ here a list of useful blogs about Data Vault:

    http://danlinstedt.com
    https://hanshultgren.wordpress.com
    https://tedamoh.com/blog
    https://learndatavault.com/blog/
    http://roelantvos.com/blog/

    I’m sure the list is not complete, but these are the links I regularly use.

    Cheers, Dani (not Jason)

    Like

  6. Hi,
    For me it seems that a V_CUSTOMER_HIST.VALID_TO column should be calculated using S_CUSTOMER_PIT.LOAD_END_DATE instead of S_CUSTOMER_PIT.LOAD_DATE as now you have same dates in VALID_TO field.

    What do you think about it?

    Best Regards,
    Jan Skorupiński

    Like

    • Hi Jan

      Of course, you are right! That was a silly mistake I made when I updated the blog post a few months ago.
      I changed the SQL code, it should be correct now. Thanks for your comment.

      Cheers, Dani.

      Like

  7. So does it mean the PIT table has a column called LOAD_END_DATE? If so what is the logic to load that column?

    Also, regarding the type2 query where we are joining PIT with satellite on the basis of H_CUSTOMER_SID and LOAD_DATE. First PIT is joined with Hub on H_CUSTOMER_SID. Lets take your example, this join will produce 8 records with different PIT LOAD_DATE. Now we are making a left join with S_CUSTOMER_MAIN on S1.H_CUSTOMER_SID = PIT.H_CUSTOMER_SID AND S1.LOAD_DATE = PIT.S1_LOAD_DATE. This will also produce 8 records but only first record will have FIRST_NAME, LAST_NAME, DATE_OF_BIRTH, GENDER. rest of the records will have this information null (because no matching record found from S_CUSTOMER_MAIN on LOAD DATE. So how this is represented as Type-2. I believe all 8 instances of should have S_CUSTOMER_MAIN data populated as this satellite data was available from day-1.

    Please add your suggestion.

    Like

    • Hi Ajay

      Thanks for your comment. The LOAD_END_DATE in the PIT table is not mandantory, but makes it easier to find all records that are vaild on a particular day. It contains the same value as the LOAD_DATE of the next record and is calculated with the analytic function LEAD, as described in the blog post.

      About your concerns with the NULL records: If you go through the example step by step, you will see how it works. For your case, the PIT table will contain 8 records with different load dates, but all of them contain the same load date for the specific satellite (e.g. S1_LOAD_DATE). So, the joins in the views afterwards would return the same values for FIRST_NAME, LAST_NAME, etc. for all 8 versions.

      Cheers, Dani.

      Like

  8. Hi Dani,
    Thank you for posting this article and it has helped my understanding. Can you also please an example for building a view for a fact with both SCD1 and SCD2?

    Like

      • Thanks Dani.
        I have one more question regarding the PIT tables, The query in your example will not give the load date on a day where there is no update in any of the satellite. Because of that we have to write a subquery to fetch the max load date for a hash key.
        We do not have end date in our PIT tables, Can we handle this without end date?
        The only option I see is to generate hash keys for all the missing days by joining a date table with Load_dates(CTE in your query).

        Like

  9. Hi Saray

    If your PIT table contains no end date, I know (or have seen) the following options:
    1. The PIT table contains a row per key for each day. Because the PIT table will become very big, it is a common approach to use partitioning and drop old partitions after some months.
    2. Calculate the end day in the query. In this case, I would recommend to use analytic functions instead of subqueries.

    To avoid big PIT tables and slow query performance, I prefer to use an end date in the PIT table (but not in the satellites). This works fine as long as the PIT table can be rebuilt after every load. Otherwise, I would prefer option 1.

    Regards, Dani.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s