The data type of primary key and foreign key columns has an impact on the query performance. This should be considered in the design decision of keys in a Data Vault schema.
In a performance review of a customer project, we detected a lot of read and write operations on the TEMP tablespace. These I/O operations were caused to hash joins of large tables in a Data Vault schema. The customer already increased the size of the Private Global Area (PGA) with the initialization parameters pga_aggregate_target and pga_aggregate_limit to get more memory for the hash tables. But for queries on many tables, the TEMP tablespace is still used extensively. How can this I/O been reduced?
A common approach in Data Vault is the usage of hash keys for primary and foreign keys. My customer uses MD5 hash values. The 128 bits are stored as a hexadecimal string in columns of data type CHAR(32). My recommendation is to use the data type RAW(16) instead. This not only reduces the amount of disk space to store the data, but has an impact on the performance of large joins. The following test case shows this effect.
Test Example
For the test example, I use my Microbrewery Data Vault demo schema on an Autonomous Data Warehouse (ADWC) database. Originally, I used this schema to test the ETL performance of ADWC. The data is only random data, generated at the beginning of each “daily load” (the amount of sales data is far above from what I sell in my real micro brewery).
The following query joins 7 tables (3 hubs, 2 links, 2 satellites) of my demo schema. The biggest tables are H_ORDER_ITEM, L_ORDER_ITEM, L_BEER_ORDER_ITEM and S_ORDER_ITEM with 68 million rows each. The hub H_ORDER_ITEM will not be joined due to join elimination. For the satellites S_ORDER_ITEM and S_BEER_DESCRIPTION, a view is used instead of the table. These “current views” return only the current versions for each hub key, retrieved with an analytical function MAX(load_date) OVER (PARTITION BY h_order_item_key). This requires additional TEMP space for sort operations, as we will see later.
SELECT hbee.beer_name
, sbde.style
, SUM(sori.quantity) num_of_bottles
, SUM(sori.price_total) total_amount
, COUNT(*) num_lines
FROM h_order hord
JOIN l_order_item lori
ON (hord.h_order_key = lori.h_order_key)
JOIN h_order_item hori
ON (lori.h_order_item_key = hori.h_order_item_key)
JOIN v_curr_s_order_item sori
ON (hori.h_order_item_key = sori.h_order_item_key)
JOIN l_beer_order_item lboi
ON (hori.h_order_item_key = lboi.h_order_item_key)
JOIN h_beer hbee
ON (lboi.h_beer_key = hbee.h_beer_key)
JOIN v_curr_s_beer_description sbde
ON (hbee.h_beer_key = sbde.h_beer_key)
GROUP BY
hbee.beer_name
, sbde.style
ORDER BY total_amount DESC
For the test, I created three identical schemas with the same amount of data, The only difference are the data types for the primary key and foreign key columns:
- Schema 1 contains MD5 hash keys, stored in CHAR(32) columns
- Schema 2 contains MD5 hash keys, stored in RAW(16) columns
- Schema 3 contains numeric surrogate keys, stored in NUMBER columns
Analysis of Execution Plan
The query above was executed in each schema. The execution plan for all three execution is almost identical. The query is executed in parallel, and all tables are joined with hash joins. The hubs H_ORDER and H_ORDER_ITEM are not displayed in the execution plan. Because no columns are required for our test query, join elimination is used here by the optimizer. The WINDOW SORT operations in lines 29 and 53 are required for the analytical views in the current satellite views. The RESULT CACHE operation in line 1 is shown because ADWC has activated result cache by default. So, the elapsed times below are the results of the first execution in each schema. The second executions are much faster (< 1 second) because the result cache can be used.
The most interesting information for this test is the column TempSpc in the execution plan (depending on your browser, you have to scroll to the right to see it). The column displays the amount of memory that is stored in the TEMP tablespace (in this case for schema 1 with CHAR(32) columns). TEMP is used for the following steps:
- Join between L_ORDER_ITEM and L_BEER_ORDER_ITEM (line 36)
- Sort for analytical function in view V_CURR_S_ORDER_ITEM (line 29)
- Join of previous join result with S_ORDER_ITEM (line 24)
----------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows |TempSpc|
----------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | |
| 1 | RESULT CACHE | 86k6map37mfj4fz7xs31fcubqn | | |
| 2 | PX COORDINATOR | | | |
| 3 | PX SEND QC (ORDER) | :TQ10013 | 20 | |
| 4 | SORT ORDER BY | | 20 | |
| 5 | PX RECEIVE | | 20 | |
| 6 | PX SEND RANGE | :TQ10012 | 20 | |
| 7 | HASH GROUP BY | | 20 | |
| 8 | PX RECEIVE | | 20 | |
| 9 | PX SEND HASH | :TQ10011 | 20 | |
| 10 | HASH GROUP BY | | 20 | |
| 11 | HASH JOIN | | 20 | |
| 12 | PX RECEIVE | | 20 | |
| 13 | PX SEND HYBRID HASH | :TQ10009 | 20 | |
| 14 | STATISTICS COLLECTOR | | | |
| 15 | HASH JOIN BUFFERED | | 20 | |
| 16 | PX RECEIVE | | 20 | |
| 17 | PX SEND HYBRID HASH | :TQ10006 | 20 | |
| 18 | STATISTICS COLLECTOR | | | |
| 19 | VIEW | VW_GBC_25 | 20 | |
| 20 | HASH GROUP BY | | 20 | |
| 21 | PX RECEIVE | | 20 | |
| 22 | PX SEND HASH | :TQ10005 | 20 | |
| 23 | HASH GROUP BY | | 20 | |
| 24 | HASH JOIN | | 68M| 1434M|
| 25 | PX RECEIVE | | 68M| |
| 26 | PX SEND HYBRID HASH | :TQ10003 | 68M| |
| 27 | STATISTICS COLLECTOR | | | |
| 28 | VIEW | | 68M| |
| 29 | WINDOW SORT | | 68M| 3928M|
| 30 | PX RECEIVE | | 68M| |
| 31 | PX SEND HASH | :TQ10000 | 68M| |
| 32 | PX BLOCK ITERATOR | | 68M| |
| 33 | TABLE ACCESS STORAGE FULL| S_ORDER_ITEM | 68M| |
| 34 | PX RECEIVE | | 68M| |
| 35 | PX SEND HYBRID HASH | :TQ10004 | 68M| |
| 36 | HASH JOIN BUFFERED | | 68M| 733M|
| 37 | PX RECEIVE | | 68M| |
| 38 | PX SEND HYBRID HASH | :TQ10001 | 68M| |
| 39 | STATISTICS COLLECTOR | | | |
| 40 | PX BLOCK ITERATOR | | 68M| |
| 41 | TABLE ACCESS STORAGE FULL | L_ORDER_ITEM | 68M| |
| 42 | PX RECEIVE | | 68M| |
| 43 | PX SEND HYBRID HASH | :TQ10002 | 68M| |
| 44 | PX BLOCK ITERATOR | | 68M| |
| 45 | TABLE ACCESS STORAGE FULL | L_BEER_ORDER_ITEM | 68M| |
| 46 | PX RECEIVE | | 20 | |
| 47 | PX SEND HYBRID HASH | :TQ10007 | 20 | |
| 48 | PX BLOCK ITERATOR | | 20 | |
| 49 | TABLE ACCESS STORAGE FULL | H_BEER | 20 | |
| 50 | PX RECEIVE | | 20 | |
| 51 | PX SEND HYBRID HASH | :TQ10010 | 20 | |
| 52 | VIEW | | 20 | |
| 53 | WINDOW SORT | | 20 | |
| 54 | PX RECEIVE | | 20 | |
| 55 | PX SEND HASH | :TQ10008 | 20 | |
| 56 | PX BLOCK ITERATOR | | 20 | |
| 57 | TABLE ACCESS STORAGE FULL | S_BEER_DESCRIPTION | 20 | |
----------------------------------------------------------------------------------------------------------------
Comparison of Results
The TEMP space requirements depend on the size of the join columns. Therefore, the data type of the primary key and foreign key columns defines how much space is required in the PGA memory or on TEMP tablespace. As you can see in the overview of the results, the amount of TempSpc (total of column TempSpc of the execution plan) is higher for CHAR(32) columns than for RAW(16) columns. A much lower amount is required for data type NUMBER.
Schema | PK / FK Columns | Datatype | TempSpc | Elapsed |
DANI_DV_1 | MD5 Hash | CHAR(32) | 6095M | 09:48 |
DANI_DV_2 | MD5 Hash | RAW(16) | 4190M | 07:33 |
DANI_DV_3 | Sequence | NUMBER | 2364M | 03:07 |
This has a direct impact on the performance of the query (always first execution). As you can see in column Elapsed, the query in schema with CHAR(32) data type was slower than with RAW(16) data type. Joins on numeric surrogate keys are faster than both implementations of MD5 hash keys. The following charts show the TEMP space usage and elapsed time for each data type for the test query:
Recommendations
Many people use hash keys in Data Vault because they think this is mandatory. There are good reasons for hash keys (e.g. independent loads of hubs, links and satellites, unique keys in heterogeneous environments, same key values in different database environments). But from a performance perspective, numeric keys are the better choice. If all your Data Vault tables are stored in one database and good query performance is an important requirement, numeric keys are still a good option. If you decide to use hash keys, store them in RAW columns, not in CHAR or VARCHAR2 columns.
Great recommendation to use RAW(16) for MD5 in Oracle. I did my first DV 2.0 with MD5 about 8 years ago in what became a 25TB Oracle EDW. This would have been helpful. We did the same type of testing on surrogate keys vs MD5. Yes integers were faster, but we needed to load in parallel so we went with MD5. The question really is faster loads vs fast query right?
LikeLiked by 1 person