Reference Partitioning on Star Schema: A Good Idea?

In my last blog post, I noticed that surrogate keys on a time dimension cannot be combined with partitioned fact tables in a star schema. Actually, there is a way to do that: With reference partitioning it is possible to implement monthly partitioned fact tables based on a surrogate time key. But is this a good idea?

A few hours after I published the blog post Time Dimension Keys and Partitioning, my former colleague Franck Pachot asked me on Twitter: “What do you think about reference partitioning for fans of surrogate key? Partition the DIM and FACT is partitioned.” I never had this idea before, but I found this approach interesting enough to  implement a little test case. The conclusion: It works, but it is not very useful for star schemas. Why? Let’s go into details.

Interval Reference Partitioning

Reference partitioning allows to use the same partition key in a master and a detail table, even if the partition key is not included in the detail table. Since Oracle 12c it can be combined with interval partitioning on the master table, as we will see in the following example.

In a star schema, the master table is a dimension table, the detail table is the fact table. Because we want to partition the fact table by date, we use the time dimension as the master table. The first step is to create a time dimension with monthly partitions. The primary key of this table is a surrogate key. Interval partitioning can be used to create monthly partitions. The partition key is the actual date column – in our example CALENDAR_DAY. If we fill the time dimension with calendar dates for the next 10 years, 120 monthly partitions are created on table DIM_DATE automatically.

CREATE TABLE dim_date
(DATE_ID NUMBER NOT NULL -- surrogate key
,CALENDAR_DAY DATE NOT NULL -- partition key
,DAY_OF_WEEK VARCHAR2(10) NOT NULL
,DAY_OF_MONTH NUMBER(2) NOT NULL
,WEEK_OF_YEAR NUMBER(2) NOT NULL
,MONTH_NUMBER NUMBER(2) NOT NULL
,MONTH_DESC VARCHAR2(15) NOT NULL
,QUARTER_NUMBER NUMBER(1) NOT NULL
,QUARTER_DESC VARCHAR2(8) NOT NULL
,YEAR_NUMBER NUMBER(4) NOT NULL
)
PARTITION BY RANGE (calendar_day)
INTERVAL(numtoyminterval(1, 'MONTH'))

(PARTITION p_old_data VALUES LESS THAN (TO_DATE('2014-01-01', 'YYYY-MM-DD')));
ALTER TABLE dim_date
ADD CONSTRAINT dim_date_pk PRIMARY KEY (date_id);

The column CALENDAR_DAY is not available in the fact table, but reference partitioning allows to create monthly partitions on this table, too. The partition key is a foreign key constraint, in our case the foreign key to the time dimension. This will create a separate partition on the fact table for each monthly partition of the time dimension. With the combination of interval partitioning and reference partitioning, only the required partitions are created on the fact table. Even when we have 120 partitions on table DIM_DATE, the fact table FCT_SALES contains only those partitions that are currently filled. Whenever facts are loaded for a DATE_ID that refers to a new month, an additional partition is created for the fact table.

CREATE TABLE fct_sales
(date_id NUMBER NOT NULL
,prod_id NUMBER NOT NULL
,cust_id NUMBER NOT NULL
,amount NUMBER NOT NULL
,CONSTRAINT fk_sales_date FOREIGN KEY(date_id) REFERENCES dim_date
)
PARTITION BY REFERENCE (fk_sales_date);
CREATE BITMAP INDEX bm_sales_date_id
ON fct_sales (date_id) LOCAL;

In Oracle 11g, reference partitioning is already available, but not interval reference partitioning. In this case, we would create the time dimension table with “classical” RANGE partitioning. Reference partitioning can be used for the fact table, but it would create all partitions in advance (i.e. 120 partitions if the time dimension contains the calendar for 10 years).

Star Schema Queries

So, we are able to create a monthly partitioned fact table, even when we use a surrogate key for the primary key of the time dimension. Let’s do some queries now on the star schema. Typically, a star schema query joins the fact table with one or more dimension tables and filters the rows on dimension columns. The following example retrieves the total amount per day for March 2015. Because we want to avoid date arithmetic in the queries, the time dimension contains convenient columns for each aggregation level. To select all rows for one month, we can use a filter on column MONTH_DESC:

SELECT calendar_day, SUM(amount)
FROM fct_sales f
JOIN dim_date d ON (d.date_id = f.date_id)
WHERE d.month_desc = 'March 2015'
GROUP BY calendar_day;

Because we have monthly partitions on the fact table, the query optimizer should be able to use partition pruning. For March 2015, only one partition has to be accessed. Unfortunately, the reality looks different, as we can see in the execution plan:

----------------------------------------------------------------
| Id | Operation | Name | ... | Pstart| Pstop |
----------------------------------------------------------------
| 0 | SELECT STATEMENT | | ... | | |
| 1 | PARTITION RANGE ALL | | ... | 1 |1048575|
| 2 | HASH GROUP BY | | ... | | |
| 3 | NESTED LOOPS | | ... | | |
|* 4 | TABLE ACCESS FULL| DIM_DATE | ... | 1 |1048575|
|* 5 | TABLE ACCESS FULL| FCT_SALES | ... | 1 |1048575|
----------------------------------------------------------------

A join of all partitions of the fact table and the time dimension is executed. Why? Because we have no restriction on the partition key CALENDAR_DAY. Therefore all partitions of the time dimension are read and joined with all partitions of the fact table. This is not what we expect for good performance in a star schema.

Of course, we can rewrite the query with a filter on CALENDAR_DAY. Because this is the partition key of the dimension table, partition pruning works now. Only one partition of DIM_DATE is accessed and joined with the corresponding partition of the fact table. The performance will be good, but this is not the way we want to write our queries in a star schema:

SELECT calendar_day, SUM(amount)
FROM fct_sales f
JOIN dim_date d ON (d.date_id = f.date_id)
WHERE d.calendar_day BETWEEN TO_DATE('2015-03-01', 'YYYY-MM-DD')
AND TO_DATE('2015-03-31', 'YYYY-MM-DD')

GROUP BY calendar_day;
--------------------------------------------------------------------------------------
| Id | Operation | Name | ... | Pstart| Pstop |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | ... | | |
| 1 | HASH GROUP BY | | ... | | |
| 2 | NESTED LOOPS | | ... | | |
| 3 | NESTED LOOPS | | ... | | |
| 4 | PARTITION RANGE SINGLE | | ... | 16 | 16 |
|* 5 | TABLE ACCESS FULL | DIM_DATE | ... | 16 | 16 |
| 6 | PARTITION REFERENCE ITERATOR | | ... | KEY | |
| 7 | BITMAP CONVERSION TO ROWIDS | | ... | | |
|* 8 | BITMAP INDEX SINGLE VALUE | BM_SALES_DATE_ID | ... | KEY | |
| 9 | TABLE ACCESS BY LOCAL INDEX ROWID| FCT_SALES | ... | 1 | 1 |
--------------------------------------------------------------------------------------

It works, but…

We have seen in this test case that reference partitioning (or interval reference partitioning) allows to create partitioned fact tables, even when a surrogate key is used for the time dimension. But we are not able anymore to run queries on the star schema with filters on any columns of the time dimension except on the partition key. Therefore I would not recommend to use this approach in a real data warehouse. As already explained before, I will still use a DATE primary key for the time dimension, but surrogate keys for all other dimensions.

Thank you, Franck, for the idea of this test case.

Advertisements

One thought on “Reference Partitioning on Star Schema: A Good Idea?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s