Time Dimension Keys and Partitioning

In the physical design of a star schema, it is recommended to use surrogate keys for the primary keys of dimension tables. Except for the time dimension. In Oracle data warehouses it is common to use a DATE column as the primary key. The reason for this decision and several alternative solutions are described in this post.

I’m used to receive questions about data warehouse design from colleagues, customers or training attendees in Switzerland, Germany or Austria. Therefore I was surprised when a got an e-mail from somebody at Bowling Green State University in Ohio with the following question:

We are using surrogate keys everywhere. I am researching trying to figure out how we could partition on dates in fact tables. Do we need to denormalize to put the date data in the fact table to facilitate partitioning on the date? Seems like the big pro’s are preaching surrogate keys, but partitioning and bitmapped indexes would like real keys.

I was surprised about the origin of the mail, but not about the question. The combination of surrogate keys and partitioning is a common issue and not easy to handle. This is one of the reasons why a natural DATE key is used for the time dimension in most Oracle data warehouses. Let’s see for alternatives and decide what is the most reasonable solution.

Surrogate Key

A common recommendation in data warehouses is to use surrogate keys as primary keys for all dimensions. Explanations and reasons for this approach can be found in several books and articles of Ralph Kimball, for example in his article “Keep control over record identifiers by generating new keys for the data warehouse“. Because the dimension keys have no business meaning and are only used for joining the fact table with the dimension tables, there is no reason to use a natural key. The only exception is, when the fact table is partitioned. The time dimension key is typically used as partition key for the fact table. If you work with daily partitions, it is simple to use a surrogate key as a partition key. But to create a fact table with monthly partitions, the definition of the partition borders is hard to define and even harder to understand: 

CREATE TABLE fct_sales
(date_id NUMBER
,prod_id NUMBER
,cust_id NUMBER
,amount NUMBER
)
PARTITION BY RANGE (date_id)
(PARTITION p_2014_01 VALUES LESS THAN (32)
,PARTITION p_2014_02 VALUES LESS THAN (60)
,PARTITION p_2014_03 VALUES LESS THAN (91)
...
,PARTITION p_2014_11 VALUES LESS THAN (305)
,PARTITION p_2014_12 VALUES LESS THAN (335)
,PARTITION p_2015_01 VALUES LESS THAN (366)
,PARTITION p_2015_02 VALUES LESS THAN (397)
...
)

This example only works if the surrogate key is sorted in the same order as the dates – otherwise it is not possible at all to create monthly partitions. Strictly speaking, this is not a surrogate key anymore because of the sort order restriction. But even if we are more tolerant and define the surrogate key ordered by calendar day, this kind of dimension key is not practical for partitioned fact tables.

Numeric Date Key

If we replace the surrogate keys with numeric date keys in the form year/month/date (date format ‘YYYYMMDD’), we have a sorted list of numbers that is much easier to use for partitioning, as shown in the following example:

CREATE TABLE fct_sales
(date_id NUMBER
,prod_id NUMBER
,cust_id NUMBER
,amount NUMBER
)
PARTITION BY RANGE (date_id)
(PARTITION p_2014_01 VALUES LESS THAN (20140201)
,PARTITION p_2014_02 VALUES LESS THAN (20140301)
,PARTITION p_2014_03 VALUES LESS THAN (20140401)
...
,PARTITION p_2014_11 VALUES LESS THAN (20141201)
,PARTITION p_2014_12 VALUES LESS THAN (20150101)
,PARTITION p_2015_01 VALUES LESS THAN (20150201)
,PARTITION p_2015_02 VALUES LESS THAN (20150301)
...
)

Instead of a surrogate key we have a natural key of data type NUMBER. This solution works fine for partitioning – even for monthly partitions. Because of the date format ‘YYYYMMDD’, the keys are in the correct order to use RANGE partitioning. But we have another issue: The query optimizer is not able to estimate the correct cardinality for queries with a filter on the numeric date column (see Christian Antognini’s paper “Datatypes: Worst Practices”). The following query on the time dimension returns 31 rows for March 2015, but the optimizer estimates only 3 rows.

SELECT * FROM dim_date
WHERE date_id BETWEEN 20150301 AND 20150331
-------------------------------------------------------------------
| Id | Operation | Name | Rows |
-------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 3 |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| DIM_DATE | 3 |
|* 2 | INDEX RANGE SCAN | DIM_DATE_PK | 3 |
-------------------------------------------------------------------

In a well designed star schema, this should not be a challenge, as long as the key column is used only for joins between time dimension and fact table. Filters are applied on descriptive columns on the time dimension. In our example, the time dimension contains an additional column CALENDAR_DAY of type DATE. If this column is used as a filter condition, the estimation will be correct.

 SELECT * FROM dim_date
WHERE calendar_day BETWEEN TO_DATE('2015-03-01', 'YYYY-MM-DD')
AND TO_DATE('2015-03-31', 'YYYY-MM-DD');

An even better way to select all days of March 2015 is a filter on the month description, which is usually available in a complete time dimension:

SELECT * FROM dim_date
WHERE month_desc = 'March 2015';

In other words: Although a numeric date column should be avoided, it causes no wrong estimations in a star schema. This is the case because the date key is used exclusively as join column, not as a filter condition. With the format year/month/day, the numeric value can be used for RANGE partitioning, but not for INTERVAL partitioning, as we will see later.

Date Key

The typical approach in many Oracle data warehouses is to use a DATE column as primary key for the time dimension. In this case, the foreign key on the fact table is of datatype DATE, too. It can easily be used as the partition key for a RANGE partitioned fact table. The following statement shows how a fact table with monthly partitions is usually created:

CREATE TABLE fct_sales
(date_id DATE
,prod_id NUMBER
,cust_id NUMBER
,amount NUMBER
)
PARTITION BY RANGE (date_id)
(PARTITION p_2014_01 VALUES LESS THAN (TO_DATE('2014-02-01', 'YYYY-MM-DD'))
,PARTITION p_2014_02 VALUES LESS THAN (TO_DATE('2014-03-01', 'YYYY-MM-DD'))
,PARTITION p_2014_03 VALUES LESS THAN (TO_DATE('2014-04-01', 'YYYY-MM-DD'))
...
,PARTITION p_2014_11 VALUES LESS THAN (TO_DATE('2014-12-01', 'YYYY-MM-DD'))
,PARTITION p_2014_12 VALUES LESS THAN (TO_DATE('2015-01-01', 'YYYY-MM-DD'))
,PARTITION p_2015_01 VALUES LESS THAN (TO_DATE('2015-02-01', 'YYYY-MM-DD'))
,PARTITION p_2015_02 VALUES LESS THAN (TO_DATE('2015-03-01', 'YYYY-MM-DD'))
...
)

With a DATE key, the creation of monthly partitions can be simplified with INTERVAL partitioning. Because it automates the creation of new partitions, it is very convenient to use in DWH environments with rolling history. I think I never used the “old fashioned” RANGE partitioning anymore since this feature is available in Oracle 11g.

CREATE TABLE fct_sales
(date_id DATE
,prod_id NUMBER
,cust_id NUMBER
,amount NUMBER
)
PARTITION BY RANGE (date_id)
INTERVAL(numtoyminterval(1, 'MONTH'))
(PARTITION p_old_data VALUES LESS THAN (TO_DATE('2014-01-01', 'YYYY-MM-DD')))

Another advantage of a DATE key is that the query optimizer is able to calculate correct estimations. We have seen above that a filter condition on the primary key of a dimension table should not be used in a star schema. But anyway, the optimizer is able to handle it correctly:

SELECT * FROM dim_date
WHERE date_id BETWEEN TO_DATE('2015-03-01', 'YYYY-MM-DD')
AND TO_DATE('2015-03-31', 'YYYY-MM-DD')
-------------------------------------------------------------------
| Id | Operation | Name | Rows |
-------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 31 |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| DIM_DATE | 31 |
|* 2 | INDEX RANGE SCAN | DIM_DATE_PK | 31 |
-------------------------------------------------------------------

So, what could be a reason not to use a DATE column for the primary key of the time dimension? The only argument I heard from customers using numeric date keys: The size of the fact table will increase. Yes, that’s true: The internal representation of a DATE column in Oracle is 7 bytes. If you use a NUMBER column in the format YYYYMMDD, only 5 bytes are used for each value. 2 bytes less for each row. For a fact table with 100 million rows, this would be 200 MB less data. Does this really matter? I think, no.

Julian Date Key

Another interesting alternative for a surrogate key is to use the julian day as a numeric primary key. This solution is described in Dylan Wan’s blog. The Julian day is an increasing number for each date, starting on 1st Januar 4712 B.C. The Julian day of today can be selected in Oracle with the following query:

SELECT TO_CHAR(SYSDATE, 'J') FROM dual

The Julian day can be stored as a NUMBER column and used as the primary key for the time dimension. Because it is an ordered key that identifies a particular date, it can be used as a partition key for RANGE partitioned fact tables, too. The definition of the partition borders is a bit tricky. One way is to define a virtual column of type DATE and use this column as the partition key (partitioning on virtual columns is supported since Oracle 11g):

CREATE TABLE fct_sales
(date_id NUMBER
,prod_id NUMBER
,cust_id NUMBER
,amount NUMBER
,cal_day DATE AS (TO_DATE(date_id, 'J')) VIRTUAL
)
PARTITION BY RANGE (cal_day)
(PARTITION p_2014_01 VALUES LESS THAN (TO_DATE('2014-02-01', 'YYYY-MM-DD'))
,PARTITION p_2014_02 VALUES LESS THAN (TO_DATE('2014-03-01', 'YYYY-MM-DD'))
,PARTITION p_2014_03 VALUES LESS THAN (TO_DATE('2014-04-01', 'YYYY-MM-DD'))
...
,PARTITION p_2014_11 VALUES LESS THAN (TO_DATE('2014-12-01', 'YYYY-MM-DD'))
,PARTITION p_2014_12 VALUES LESS THAN (TO_DATE('2015-01-01', 'YYYY-MM-DD'))
,PARTITION p_2015_01 VALUES LESS THAN (TO_DATE('2015-02-01', 'YYYY-MM-DD'))
,PARTITION p_2015_02 VALUES LESS THAN (TO_DATE('2015-03-01', 'YYYY-MM-DD'))
...
)

What is the benefit of this solution? The required space per value is the same as with a numeric date key (5 bytes), but because there are not gaps in the number ranges, it is easier for the optimizer to estimate the cardinality. As mentioned before, this is not really relevant for typical queries in a star schema. Therefore, to use the Julian day as the primary key for the time dimension is a nice idea, but it has no real advantages.

Summary

So, what is the best solution for the primary key of the time dimension? We have seen four different approaches. Each of them has some benefits and drawbacks.

  • Surrogate keys are the best practice approach for all dimensions, but not useful for the time dimension in combination with partitioned fact tables.
  • Numeric date keys are feasible for partitioning. They can result in wrong optimizer estimations, but for typical queries in a star schema, this is not the case.
  • Date keys are easy to use in combination with RANGE or INTERVAL partitioning, but need a little more space than numeric keys.
  • Julian date keys are similar to numeric date keys, except for the cardinality estimation of the optimizer.

What is my personal opinion? I will still use a DATE primary key for the time dimension and surrogate keys for all other dimensions. But even with the date key, I avoid to use the key directly as a filter condition in queries, but use it only for joins between fact table and time dimension.

Advertisements

2 thoughts on “Time Dimension Keys and Partitioning

  1. Hi Dani,

    I have been struggling with this very issue for a few weeks and I was sure this post would fix my problem. However, after changing all my date keys to use dates and rebuilding my fact tables with date-based partitions, I am still seeing t-9he strangest behavior.

    When I run a simple query filtered on 5 years or less the execution plan looks good and the query run fast. When I go to 6 years, the execution plan changes drastically and I lose parallelism. Do you have any idea why this happens?

    Good Plan:

    EXPLAIN PLAN FOR
    SELECT
    D.”CAL_YEAR” AS “Incur_Year”,
    sum(f.”PAID”) as “Paid”
    FROM F_ACTIVE_MED_CLAIMS F
    inner join “D_DATE” d
    on f.”INCUR_DTE_S_KEY” = d.”CAL_DATE”
    WHERE D.”CAL_YEAR” BETWEEN 2011 AND 2015
    group by D.”CAL_YEAR” ;

    SELECT PLAN_TABLE_OUTPUT FROM TABLE(DBMS_XPLAN.DISPLAY());

    Plan hash value: 3774419026

    —————————————————————————————————————————————————
    | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop | TQ |IN-OUT| PQ Distrib |
    —————————————————————————————————————————————————
    | 0 | SELECT STATEMENT | | 5 | 120 | 84156 (2)| 00:16:50 | | | | | |
    | 1 | PX COORDINATOR | | | | | | | | | | |
    | 2 | PX SEND QC (RANDOM) | :TQ10002 | 5 | 120 | 84156 (2)| 00:16:50 | | | Q1,02 | P->S | QC (RAND) |
    | 3 | HASH GROUP BY | | 5 | 120 | 84156 (2)| 00:16:50 | | | Q1,02 | PCWP | |
    | 4 | PX RECEIVE | | 5 | 120 | 84156 (2)| 00:16:50 | | | Q1,02 | PCWP | |
    | 5 | PX SEND HASH | :TQ10001 | 5 | 120 | 84156 (2)| 00:16:50 | | | Q1,01 | P->P | HASH |
    | 6 | HASH GROUP BY | | 5 | 120 | 84156 (2)| 00:16:50 | | | Q1,01 | PCWP | |
    |* 7 | HASH JOIN | | 69M| 1586M| 83703 (1)| 00:16:45 | | | Q1,01 | PCWP | |
    | 8 | PART JOIN FILTER CREATE| :BF0000 | 1826 | 21912 | 26 (0)| 00:00:01 | | | Q1,01 | PCWP | |
    | 9 | PX RECEIVE | | 1826 | 21912 | 26 (0)| 00:00:01 | | | Q1,01 | PCWP | |
    | 10 | PX SEND BROADCAST | :TQ10000 | 1826 | 21912 | 26 (0)| 00:00:01 | | | Q1,00 | P->P | BROADCAST |
    | 11 | PX BLOCK ITERATOR | | 1826 | 21912 | 26 (0)| 00:00:01 | 1 | 15 | Q1,00 | PCWC | |
    |* 12 | TABLE ACCESS FULL | D_DATE | 1826 | 21912 | 26 (0)| 00:00:01 | 1 | 15 | Q1,00 | PCWP | |
    | 13 | PX BLOCK ITERATOR | | 80M| 924M| 83629 (1)| 00:16:44 |:BF0000|:BF0000| Q1,01 | PCWC | |
    | 14 | TABLE ACCESS FULL | F_ACTIVE_MED_CLAIMS | 80M| 924M| 83629 (1)| 00:16:44 |:BF0000|:BF0000| Q1,01 | PCWP | |
    —————————————————————————————————————————————————

    Predicate Information (identified by operation id):
    —————————————————

    7 – access(“F”.”INCUR_DTE_S_KEY”=”D”.”CAL_DATE”)
    12 – filter(“D”.”CAL_YEAR”=2011)

    Bad Plan:

    EXPLAIN PLAN FOR
    SELECT
    D.”CAL_YEAR” AS “Incur_Year”,
    sum(f.”PAID”) as “Paid”
    FROM F_ACTIVE_MED_CLAIMS F
    inner join “D_DATE” d
    on f.”INCUR_DTE_S_KEY” = d.”CAL_DATE”
    WHERE D.”CAL_YEAR” BETWEEN 2010 AND 2015
    group by D.”CAL_YEAR” ;

    SELECT PLAN_TABLE_OUTPUT FROM TABLE(DBMS_XPLAN.DISPLAY());

    Plan hash value: 4263234925

    —————————————————————————————————————————————————-
    | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop | TQ |IN-OUT| PQ Distrib |
    —————————————————————————————————————————————————-
    | 0 | SELECT STATEMENT | | 3 | 75 | 5764 (1)| 00:01:10 | | | | | |
    | 1 | PX COORDINATOR | | | | | | | | | | |
    | 2 | PX SEND QC (RANDOM) | :TQ20001 | 3 | 75 | 5764 (1)| 00:01:10 | | | Q2,01 | P->S | QC (RAND) |
    | 3 | HASH GROUP BY | | 3 | 75 | 5764 (1)| 00:01:10 | | | Q2,01 | PCWP | |
    | 4 | PX RECEIVE | | 3 | 75 | 5764 (1)| 00:01:10 | | | Q2,01 | PCWP | |
    | 5 | PX SEND HASH | :TQ20000 | 3 | 75 | 5764 (1)| 00:01:10 | | | Q2,00 | P->P | HASH |
    | 6 | HASH GROUP BY | | 3 | 75 | 5764 (1)| 00:01:10 | | | Q2,00 | PCWP | |
    | 7 | NESTED LOOPS | | 170 | 4250 | 5763 (1)| 00:01:10 | | | Q2,00 | PCWP | |
    | 8 | PX BLOCK ITERATOR | | 2191 | 26292 | 26 (0)| 00:00:01 | 1 | 15 | Q2,00 | PCWC | |
    |* 9 | TABLE ACCESS FULL | D_DATE | 2191 | 26292 | 26 (0)| 00:00:01 | 1 | 15 | Q2,00 | PCWP | |
    | 10 | VIEW PUSHED PREDICATE | VW_GBC_5 | 1 | 13 | | | | | Q2,00 | PCWP | |
    |* 11 | FILTER | | | | | | | | Q2,00 | PCWP | |
    | 12 | SORT AGGREGATE | | 1 | 177 | | | | | Q2,00 | PCWP | |
    | 13 | PX COORDINATOR | | | | | | | | | | |
    | 14 | PX SEND QC (RANDOM) | :TQ10000 | 1 | 177 | | | | | Q1,00 | P->S | QC (RAND) |
    | 15 | SORT AGGREGATE | | 1 | 177 | | | | | Q1,00 | PCWP | |
    | 16 | PX BLOCK ITERATOR | | 37958 | 6561K| 5554 (1)| 00:01:07 | KEY | KEY | Q1,00 | PCWC | |
    |* 17 | TABLE ACCESS FULL| F_ACTIVE_MED_CLAIMS | 37958 | 6561K| 5554 (1)| 00:01:07 | KEY | KEY | Q1,00 | PCWP | |
    —————————————————————————————————————————————————-

    Predicate Information (identified by operation id):
    —————————————————

    9 – filter(“D”.”CAL_YEAR”=2010)
    11 – filter(COUNT(SYS_OP_CSR(SYS_OP_MSR(COUNT(*),SUM(“F”.”PAID”)),0))>0 AND 2010<=2015)
    17 – filter("F"."INCUR_DTE_S_KEY"="D"."CAL_DATE")

    Like

  2. Hi Gerard

    As I already wrote you in an e-mail, possible reasons for the wrong selectivity and therefore the bad execution plan are
    – Missing or outdated statistics
    – Missing foreign key between fact table and time dimension
    – Missing bitmap index on INCUR_DTE_S_KEY (and on all other dimension keys)

    In your case, the statistics were the root of the problem. Now it works – thank you for your feedback.

    Cheers, Dani.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s