Foreign Key Constraints in an Oracle Data Warehouse

Foreign key constraints are often omitted in Data Warehouses. This blog post explains why this is not a good approach and what is a practical compromise for foreign key constraints in an Oracle Data Warehouse.

In one of the sessions at the DOAG conference 2015, there was a long discussion about foreign key constraints. The speaker recommended to avoid constraints because of dependencies in the load order of the DWH tables. Some of the attendees contradicted that foreign keys are important for documentation and data consistency in the Data Warehouse. From my point of view, both arguments were right, but how can we deal with this oppositional requirements? Let’s investigate some more details to answer this question.

Why to Create Foreign Key Constraints?

There are several reasons why foreign key constraints should always be created in a relational database (Data Warehouses and OLTP systems):

  • A foreign key constraint proves the data consistency in the tables. For a master-detail relationship, a detail record can only be inserted or updated if the referenced key is available in the master table. And a record in the master table can only be deleted if no corresponding records are stored in the detail table (unless we create the constraint with the option ON DELETE CASCADE). In an OLTP system, this is evident, but do we have to prove these conditions in a Data Warehouse as well?
  • Foreign key constraints are used to show the relationships between tables in the database. This is very useful if you run queries on the data dictionary views to get information about the dependencies of the tables. It is also important for the documentation of a data model. If you do a reverse engineering of your database into a modeling tool to create an entity-relationship diagram of your system, foreign keys are a precondition for this approach. Without the constraints, you will have a diagram with a lot of stand-alone tables, but no relationships.
  • Finally, the query optimizer uses constraints in the database to find the best execution plan. They can help to estimate more accurate values for the cardinality. This is not only the case for foreign key constraints. For example, a NOT NULL constraint is used for estimations, too. Foreign key constraints are needed for Query Rewrite on materialized views as well as for some of the transformations of the Oracle optimizer. To explain this behavior, we look at the following query on a Star Schema:
SELECT p.prod_cat_desc,
       SUM(s.amount_sold)
  FROM sales s
  JOIN products p  ON (s.prod_id = p.prod_id)
  JOIN customers c ON (s.cust_id = c.cust_id)
  JOIN times t     ON (s.time_id = t.time_id)
 WHERE t.calendar_month_desc = '2014-03'
GROUP BY p.prod_cat_desc;
 
---------------------------------------------------------------
| Id  | Operation                            | Name           |
---------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                |
|   1 |  HASH GROUP BY                       |                |
|*  2 |   HASH JOIN                          |                |
|   3 |    TABLE ACCESS FULL                 | PRODUCTS       |
|   4 |    NESTED LOOPS                      |                |
|   5 |     NESTED LOOPS                     |                |
|*  6 |      TABLE ACCESS FULL               | TIMES          |
|   7 |      PARTITION RANGE ITERATOR        |                |
|   8 |       BITMAP CONVERSION TO ROWIDS    |                |
|*  9 |        BITMAP INDEX SINGLE VALUE     | SALES_TIME_BIX |
|  10 |     TABLE ACCESS BY LOCAL INDEX ROWID| SALES          |
---------------------------------------------------------------

The query joins the fact table SALES with the three dimension tables PRODUCTS, CUSTOMERS and TIMES. Because no information of the CUSTOMER dimension is required to select the result, the query optimizer is able to eliminate this table from the execution plan. This kind of transformation is called Join Elimination and is very useful for queries generated by BI tools on a Star Schema, but can also be helpful for ETL queries on Core tables (see blog post The Advantage of Silly Mistakes).

When no foreign key constraints are defined on the Star Schema, Join Elimination does not work as expected, and the CUSTOMERS table is joined although no information is used in the query result. This example shows that constraints are not only important for data consistency and clarity of the data model, but can also be used by the optimizer for some query transformations.

Why to Avoid Foreign Key Constraints?

There are reasons against foreign key constraints in a Data Warehouse. The following arguments are often mentioned:

  • Additional time during the data loads is required to check the validity of the constraints. But does this make sense? When we load a high number of facts into a fact table and do a key lookup for each dimension key, why should we check again these keys before they are inserted into the fact table? The ETL job already guarantees that only valid dimension keys are stored in the fact table. So, there is no need to prove the data again.
  • With foreign key constraints, the tables must be loaded in a predefined order. The master table (e.g. a dimension) must be loaded before the detail table (e.g. a fact table) to guarantee that the referenced keys are available. In some cases, it is desirable to load the tables in parallel. This is the case if the keys can be loaded from a previous DWH layer (e.g. surrogate keys of the Core tables are reused in the Data Mart tables).
  • If hash keys are used instead of sequence numbers, the master and detail tables can be loaded independently from each other. For example, in a Data Vault 2.0 schema, Hubs and Satellites are loaded at the same time. But with foreign key constraints, it is usually not possible to load a Satellite table when the corresponding keys are not yet stored in the referenced Hub table.

How can we handle these oppositional requirements? We want to profit by the advantages of constraints, but without slowing down the ETL run time or loosing the possibility of an independent load order.

The Oracle Way: Reliable Constraints

Oracle allows to create reliable constraints. If a constraint is created with the option RELY, it can be used by the optimizer even if it is disabled or not validated. This is practical to load master and detail tables independently, and to avoid additional checks during the ETL jobs. Because the constraints are still defined in the data dictionary, they are available for queries to retrieve relationships as well as for reverse-engineering of the data model into a modeling tool.

When a foreign key constraint is defined as RELY, the referenced primary key must be reliable, too. To define a reliable foreign key between fact table SALES and dimension table PRODUCTS, we can use the following statements:

ALTER TABLE products
ADD CONSTRAINT products_pk
PRIMARY KEY (prod_id) RELY;
 
ALTER TABLE sales
ADD CONSTRAINT sales_prod_fk
FOREIGN KEY (prod_id)
REFERENCES products (prod_id)
RELY ENABLE NOVALIDATE;

In this example, the foreign key is still enabled, and all new rows inserted into table SALES would be checked for a valid PROD_ID. To avoid this check, the foreign key constraints are usually disabled before the load. When a constraint is disabled, the data consistency is not checked anymore. Therefore, it is possible to load rows into the fact table SALES even when the referenced dimension keys are not yet available at this time.

ALTER TABLE sales
MODIFY CONSTRAINT sales_prod_fk
DISABLE;

After the ETL jobs are finished, the constraints can be enabled. But enabling a constraint usually means that all data in the table is proved again – even the rows that were loaded in previous loads. This can cause a lot of time and resources. Therefore, a pragmatic way is to enable the constraint with the NOVALIDATE option. In this case, the constraint is enabled, but for the existing data no data consistency is validated.

ALTER TABLE sales
MODIFY CONSTRAINT sales_prod_fk
ENABLE NOVALIDATE;

Because the constraint is still defined as RELY, it is reliable for the optimizer and can be used for cardinality estimations and query transformations. For Query Rewrite on materialized views, the following initialization parameters must be set to allow the optimizer to work with reliable constraints:

ALTER SESSION SET query_rewrite_enabled = TRUE;
ALTER SESSION SET query_rewrite_integrity = TRUSTED;

The parameter query_rewrite_enabled is TRUE by default and allows to use Query Rewrite if a suitable materialized view is available. The parameter query_rewrite_integrity is set to ENFORCED by default. With this value, Query Rewrite works only on enabled and validated constraints. By setting the parameter to TRUSTED, Oracle is allowed use Query Rewrite even when the constraints are defined as ENABLE NOVALIDATE or even DISABLE. Since Oracle 12c, this setting is also required for Join Elimination (see blog post Join Elimination: Difference in Oracle 11g and 12c).

Actually, there is no difference for the optimizer whether a reliable constraint is enabled or disabled. Therefore, the is no need to enable the foreign key constraints after each ETL jobs. Rather, the constraints can be declared as RELY DISABLE NOVALIDATE all the time.

Recommendation for Data Warehouses

When the relationships between tables in a Data Warehouse can be guaranteed by the ETL jobs, we don’t have to implement foreign key constraints for data consistency. In this case, it is recommended to define reliable foreign key constraints. This is much better than to create no constraints at all. Reliable constraints are useful for documentation of the relationships as well as for query transformations. When the ETL jobs do not guarantee data consistency, constraints should be enabled and validated.

1 thought on “Foreign Key Constraints in an Oracle Data Warehouse

  1. RELY DWH NOVALIDATE is my DWH constraint mantra – it helps developers and tools _A_LOT_. I hate switching focus to model or some kind of data cleansing tool all the time, to look up for data constraints of any kind.

    Like

Leave a comment