Data Vault Queries and Join Elimination

If you work with Data Vault for a data warehouse running in an Oracle database, I strongly recommend to use Oracle 12.2 or higher. Why that? Since Oracle 12c Release 2, join elimination works for more than one join column. This is essential for queries on a Data Vault schema.

I often have discussions with customers or training attendees that have concerns about query performance on Data Vault schemas. Data Vault is a suitable data modeling method for integration and historization of data from different source systems in a data warehouse. But because a Data Vault schema typically contains a high number of tables, a lot of joins are required to select data from all the Hubs, Links and Satellites that are involved in each query. A little-known performance feature introduced with Oracle 12.2 helps to improve the query performance.

In a DOAG presentation this year, we have shown several possibilities how data can be loaded from a Data Vault schema into a Star Schema. In one live demo, I explained the purpose of join elimination in this context. Since Oracle 12.2, multi-column join elimination is supported, as Jonathan Lewis wrote 3 years ago in his blog post Join Elimination 12.2. The benefit of this feature for Data Vault I want to explain in this blog post in detail.

In Data Vault, we typically have multiple Satellite tables attached to a Hub table. A separate Satellite may be created for each source system, for different change frequencies, for additional columns due to new requirements and so on. So, it is common to have many Satellites for one Hub. This has advantages for independent load jobs, data model enhancements, etc., but makes it more difficult to extract data from the Data Vault schema. To reduce the complexity of the extraction jobs or ad-hoc queries, a proven approach is to create (or better: generate) a view layer on top of the Data Vault model. For example, we can create two views for each Hub and its corresponding Satellites:

  • The Current View returns the current version of all Satellites for each Hub key
  • The Version View (or History View) returns the whole history of the data, with a separate version for each validity period

For good query performance, a common approach in Data Vault is to create a “Point in Time table” (PIT table) for each Hub. I already explained in blog post Loading Dimensions from a Data Vault Model how the PIT table and the view layer can be implemented. A similar approach I used for my demo example at the DOAG presentation.

Let’s assume we have a Hub H_CUSTOMER with three Satellite tables S_CUSTOMER_INFO, S_CUSTOMER_ADDRESS and S_BILLING_ADDRESS. The PIT table PIT_CUSTOMER contains the load date for each Satellite version that is valid in a particular validity range. Details are explained here. On top of these tables, we create a Current View and a Version View.

PIT Customer

The two views look almost the same, the only difference is that the Current View contains a filter WHERE pit.load_end_date IS NULL to return only the current versions. The Version View does not apply this filter, but on the other hand contains two additional columns VALID_FROM and VALID_TO to describe the validity range. To show the behaviour of join elimination, I will use the Current View, but it works exactly the same with the Version View.

The Current View for our example contains all attributes of all Satelllites and is created like this:

 

CREATE OR REPLACE VIEW v_curr_customer AS

SELECT hub.h_customer_key

     , hub.customer_no

     , s1.preferred_contact

     , s1.e_mail_address

     , s1.phone_number

     , s1.private_person

     , s1.reseller

     , s1.delivery_type

     , s2.last_name cust_last_name

     , s2.first_name cust_first_name

     , s2.street cust_street

     , s2.street_no cust_street_no

     , s2.zip_code cust_zip_code

     , s2.city cust_city

     , s3.last_name bill_last_name

     , s3.first_name bill_first_name

     , s3.street bill_street

     , s3.street_no bill_street_no

     , s3.zip_code bill_zip_code

     , s3.city bill_city

  FROM h_customer hub

  JOIN pit_customer pit

    ON (hub.h_customer_key = pit.h_customer_key)

  LEFT JOIN s_customer_info s1

    ON (s1.h_customer_key = pit.h_customer_key AND s1.load_date = pit.s1_load_date)

  LEFT JOIN s_customer_address s2

    ON (s2.h_customer_key = pit.h_customer_key AND s2.load_date = pit.s2_load_date)

  LEFT JOIN s_billing_address s3

    ON (s3.h_customer_key = pit.h_customer_key AND s3.load_date = pit.s3_load_date)

 WHERE pit.load_end_date IS NULL;

 

As you can see, the joins between PIT table and each Satellite have two join columns: the hub key and the load date for the Satellite. Left outer joins are used because it is not guaranteed that data is available in each Satellite.

When we look at the execution plan of a query that selects columns from the Hub and each of the Satellites, all tables that are involved in the query have to be accessed. Of course!

 

EXPLAIN PLAN FOR

SELECT h_customer_key

     , customer_no     — H_CUSTOMER

     , e_mail_address  — S_CUSTOMER_INFO

     , cust_first_name — S_CUSTOMER_ADDRESS

     , bill_first_name — S_BILLING_ADDRESS

  FROM v_curr_customer;

 

SELECT * FROM dbms_xplan.display();

 

——————————————————————–

| Id  | Operation                     | Name               | Rows  |

——————————————————————–

|   0 | SELECT STATEMENT              |                    | 20000 |

|*  1 |  HASH JOIN RIGHT OUTER        |                    | 20000 |

|   2 |   TABLE ACCESS STORAGE FULL   | S_CUSTOMER_INFO    | 27000 |

|*  3 |   HASH JOIN RIGHT OUTER       |                    | 20000 |

|   4 |    TABLE ACCESS STORAGE FULL  | S_BILLING_ADDRESS  |  4964 |

|*  5 |    HASH JOIN OUTER            |                    | 20000 |

|*  6 |     HASH JOIN                 |                    | 20000 |

|   7 |      TABLE ACCESS STORAGE FULL| H_CUSTOMER         | 20000 |

|*  8 |      TABLE ACCESS STORAGE FULL| PIT_CUSTOMER       | 20000 |

|   9 |     TABLE ACCESS STORAGE FULL | S_CUSTOMER_ADDRESS | 33406 |

——————————————————————–

 

But if we comment out some of the columns or write a query that reads only columns from some of the Satellites, join elimination comes into play: Only the relevant tables are joined with the PIT table. If no data from the Hub or some of the Satellites are required, these tables are not accessed in the query. This reduces the number of joins and improves the performance of the query.

 

EXPLAIN PLAN FOR

SELECT h_customer_key

   –, customer_no     — H_CUSTOMER

     , e_mail_address  — S_CUSTOMER_INFO

     , cust_first_name — S_CUSTOMER_ADDRESS

   –, bill_first_name — S_BILLING_ADDRESS

  FROM v_curr_customer;

 

SELECT * FROM dbms_xplan.display();

 

——————————————————————

| Id  | Operation                   | Name               | Rows  |

——————————————————————

|   0 | SELECT STATEMENT            |                    | 20000 |

|*  1 |  HASH JOIN RIGHT OUTER      |                    | 20000 |

|   2 |   TABLE ACCESS STORAGE FULL | S_CUSTOMER_INFO    | 27000 |

|*  3 |   HASH JOIN OUTER           |                    | 20000 |

|*  4 |    TABLE ACCESS STORAGE FULL| PIT_CUSTOMER       | 20000 |

|   5 |    TABLE ACCESS STORAGE FULL| S_CUSTOMER_ADDRESS | 33406 |

——————————————————————

 

If you run this query on Oracle 12.1 or below, you will get the same execution plan as if you would select columns from all Satellites.

Conclusion

Join elimination is available since a long time, but since Oracle 12.2 is works for more than one join column. For the suggested approach of a view layer on top of a Data Vault model, this can be very helpful to improve the performance of queries on a Data Vault schema. This reduces the time to load a Star Schema from a Data Vault, or even allows to implement virtual data marts (i.e. dimensions and facts are implemented only as views).

Although the extended support of Oracle 12.1 was extended for another year (see blog post of Mike Dietrich), there are many reasons to switch to a newer version of the Oracle database. If you work with Data Vault, the enhancement of join elimination is an important one.

Leave a comment