The Advantage of Silly Mistakes

Silly mistakes can help to find bugs! Because of a strange behavior on a test database we found a much severe bug in the ETL jobs of my customer’s Data Warehouse. Consulting life is full of surprises.

In the test environment of a Data Warehouse at one of my customers, we recently had a strange mystery. A view returned a set of rows, but the ETL job using this view did not insert any data into the target table. Until a few days ago, I thought that the result set of a view can easily be inserted into a table. But under some circumstances this seems not to be the case:

SELECT * FROM a_view_in_a_customer_project;
12345 rows selected.
INSERT INTO target_table
SELECT * FROM a_view_in_a_customer_project;
0 rows created.

What happened here?! The reason for this surprising behavior is – once more – Join Elimination (see my previous blog post Join Elimination: Difference in Oracle 11g and 12c). In some situations, this query transformation works fine for a SELECT statement, but not for the corresponding INSERT statement. Let me explain this with the following simplified use case.

Code Lookup Views with Join Elimination

The tables in the Core layer of a Data Warehouse contain many code attributes. All these codes are stored as surrogate keys referring a central code table. This code table – let’s call it table C – consists of all codes and descriptions for different code domains (e.g. currencies, countries, status codes). All other Core tables – let’s call them T1 and T2 – have foreign keys to the central code table. In our simple example, table T1 has five code attributes, T2 has two of them.

T1 T2 C

For all tables in the Core DWH, a code lookup view is generated (see green views in picture above). Each view contains a key lookup for each code attribute of the base table and returns the code and description. These columns can be used to load the dimension tables in the Data Mart. If table T1 contains five codes, the code lookup view looks like this:

CREATE OR REPLACE VIEW v1 AS
SELECT /*+ no_merge */
t1.id
, c1.code c11_cd
, c1.description c11_desc
, c2.code c12_cd
, c2.description c12_desc
, c3.code c13_cd
, c3.description c13_desc
, c4.code c14_cd
, c4.description c14_desc
, c5.code c15_cd
, c5.description c15_desc
FROM t1
JOIN c c1 ON (c1.id = t1.c11)
JOIN c c2 ON (c2.id = t1.c12)
JOIN c c3 ON (c3.id = t1.c13)
JOIN c c4 ON (c4.id = t1.c14)
JOIN c c5 ON (c5.id = t1.c15);

If no row is found in table C for a particular code and domain during the ETL process, a reference to a singleton row (an additional code for unknown values) is stored. With this approach it is possible to use inner joins in all code lookups. The no_merge hint in the view is used to avoid cartesian joins of all the code lists. With the hint, the code views are non-mergeable and are joined like regular tables, as we see in the following simple query:

EXPLAIN PLAN FOR
SELECT v1.c11_desc
     , v1.c14_desc
     , v2.c22_desc
  FROM v1 JOIN v2 ON (v1.id = v2.t1_id);

SELECT * FROM table(dbms_xplan.display);


Plan hash value: 853892726
 
------------------------------------------------------------------------------
| Id  | Operation             | Name | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |      |   100K|    17M|   129   (3)| 00:00:02 |
|*  1 |  HASH JOIN            |      |   100K|    17M|   129   (3)| 00:00:02 |
|   2 |   VIEW                | V1   |   100 | 11700 |    18   (6)| 00:00:01 |
|*  3 |    HASH JOIN          |      |   100 | 22800 |    18   (6)| 00:00:01 |
|*  4 |     HASH JOIN         |      |   100 | 12300 |    11  (10)| 00:00:01 |
|   5 |      TABLE ACCESS FULL| T1   |   100 |  1800 |     3   (0)| 00:00:01 |
|   6 |      TABLE ACCESS FULL| C    |  1000 |   102K|     7   (0)| 00:00:01 |
|   7 |     TABLE ACCESS FULL | C    |  1000 |   102K|     7   (0)| 00:00:01 |
|   8 |   VIEW                | V2   |   100K|  6347K|   110   (1)| 00:00:02 |
|*  9 |    HASH JOIN          |      |   100K|    10M|   110   (1)| 00:00:02 |
|  10 |     TABLE ACCESS FULL | C    |  1000 |   102K|     7   (0)| 00:00:01 |
|  11 |     TABLE ACCESS FULL | T2   |   100K|   878K|   103   (1)| 00:00:02 |
------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - access("V1"."ID"="V2"."T1_ID")
   3 - access("C4"."ID"="T1"."C14")
   4 - access("C1"."ID"="T1"."C11")
   9 - access("C2"."ID"="T2"."C22")

The execution plan shows that the optimizer does a hash join of the views V1 and V2. Within the views, the tables T1 and T2 are hash joined with the code table C – twice in V1 and once in V2.

But wait! T1 consists of five different code attributes. Why is the code table joined only twice? Because only two of the codes are used in the query. To lookup the code descriptions for codes C11 and C14, the code table has to be joined two times. The additional joins for C12, C13 and C15 are not used here and therefore eliminated in the execution plan. This is how Join Elimination works.

Loading a Data Mart Table

Now, we want to use this query to load a dimension table – let’s call it T3 – in a Data Mart. We use a Direct-Path INSERT for our test query and load the result into the target table. What does the optimizer with this statement?

EXPLAIN PLAN FOR
INSERT /*+ append */ INTO t3
SELECT v1.c11_desc
     , v1.c14_desc
     , v2.c22_desc
  FROM v1 JOIN v2 ON (v1.id = v2.t1_id);

SELECT * FROM table(dbms_xplan.display);


Plan hash value: 2743973873
 
----------------------------------------------------------------------------------
| Id  | Operation                 | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | INSERT STATEMENT          |      |   100K|    17M|   132   (4)| 00:00:02 |
|   1 |  LOAD AS SELECT           | T3   |       |       |            |          |
|*  2 |   HASH JOIN               |      |   100K|    17M|   132   (4)| 00:00:02 |
|   3 |    VIEW                   | V1   |   100 | 11700 |    18   (6)| 00:00:01 |
|*  4 |     HASH JOIN             |      |   100 | 24000 |    18   (6)| 00:00:01 |
|   5 |      NESTED LOOPS         |      |   100 | 13500 |    11  (10)| 00:00:01 |
|   6 |       NESTED LOOPS        |      |   100 | 13100 |    11  (10)| 00:00:01 |
|*  7 |        HASH JOIN          |      |   100 | 12700 |    11  (10)| 00:00:01 |
|   8 |         NESTED LOOPS      |      |   100 |  2200 |     3   (0)| 00:00:01 |
|   9 |          TABLE ACCESS FULL| T1   |   100 |  1800 |     3   (0)| 00:00:01 |
|* 10 |          INDEX UNIQUE SCAN| C_PK |     1 |     4 |     0   (0)| 00:00:01 |
|  11 |         TABLE ACCESS FULL | C    |  1000 |   102K|     7   (0)| 00:00:01 |
|* 12 |        INDEX UNIQUE SCAN  | C_PK |     1 |     4 |     0   (0)| 00:00:01 |
|* 13 |       INDEX UNIQUE SCAN   | C_PK |     1 |     4 |     0   (0)| 00:00:01 |
|  14 |      TABLE ACCESS FULL    | C    |  1000 |   102K|     7   (0)| 00:00:01 |
|  15 |    VIEW                   | V2   |   100K|  6347K|   113   (2)| 00:00:02 |
|* 16 |     HASH JOIN             |      |   100K|    11M|   113   (2)| 00:00:02 |
|  17 |      TABLE ACCESS FULL    | C    |  1000 |   102K|     7   (0)| 00:00:01 |
|* 18 |      HASH JOIN            |      |   100K|  1269K|   105   (1)| 00:00:02 |
|  19 |       INDEX FAST FULL SCAN| C_PK |  1000 |  4000 |     2   (0)| 00:00:01 |
|  20 |       TABLE ACCESS FULL   | T2   |   100K|   878K|   103   (1)| 00:00:02 |
----------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("V1"."ID"="V2"."T1_ID")
   4 - access("C1"."ID"="T1"."C11")
   7 - access("C4"."ID"="T1"."C14")
  10 - access("C5"."ID"="T1"."C15")
  12 - access("C3"."ID"="T1"."C13")
  13 - access("C2"."ID"="T1"."C12")
  16 - access("C2"."ID"="T2"."C22")
  18 - access("C1"."ID"="T2"."C21")

In the execution plan, we can see a bad surprise: The Join Elimination from the previous example does not work here anymore! Seven lookups on table C (or on the primary key index C_PK) are performed, but only three of them are actually needed. For this simple case, this may be sufficient, but in the customer project they have tables with up to 30 different code attributes.

For me, it is not clear why an INSERT statement should use a different execution plan than the corresponding SELECT statement. I did not found any feasible explanation for this strange behavior, but at least a workaround: If the SELECT statement is packed into an additional Non-mergeable View, the correct execution plan is used:

EXPLAIN PLAN FOR
INSERT /*+ append */ INTO t3
SELECT * FROM (
SELECT /*+ no_merge */
       v1.c11_desc
     , v1.c14_desc
     , v2.c22_desc
  FROM v1 JOIN v2 ON (v1.id = v2.t1_id));

SELECT * FROM table(dbms_xplan.display);

Plan hash value: 3853099109
 
--------------------------------------------------------------------------------
| Id  | Operation               | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | INSERT STATEMENT        |      |   100K|    14M|   129   (3)| 00:00:02 |
|   1 |  LOAD AS SELECT         | T3   |       |       |            |          |
|   2 |   VIEW                  |      |   100K|    14M|   129   (3)| 00:00:02 |
|*  3 |    HASH JOIN            |      |   100K|    17M|   129   (3)| 00:00:02 |
|   4 |     VIEW                | V1   |   100 | 11700 |    18   (6)| 00:00:01 |
|*  5 |      HASH JOIN          |      |   100 | 22800 |    18   (6)| 00:00:01 |
|*  6 |       HASH JOIN         |      |   100 | 12300 |    11  (10)| 00:00:01 |
|   7 |        TABLE ACCESS FULL| T1   |   100 |  1800 |     3   (0)| 00:00:01 |
|   8 |        TABLE ACCESS FULL| C    |  1000 |   102K|     7   (0)| 00:00:01 |
|   9 |       TABLE ACCESS FULL | C    |  1000 |   102K|     7   (0)| 00:00:01 |
|  10 |     VIEW                | V2   |   100K|  6347K|   110   (1)| 00:00:02 |
|* 11 |      HASH JOIN          |      |   100K|    10M|   110   (1)| 00:00:02 |
|  12 |       TABLE ACCESS FULL | C    |  1000 |   102K|     7   (0)| 00:00:01 |
|  13 |       TABLE ACCESS FULL | T2   |   100K|   878K|   103   (1)| 00:00:02 |
--------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access("V1"."ID"="V2"."T1_ID")
   5 - access("C4"."ID"="T1"."C14")
   6 - access("C1"."ID"="T1"."C11")
  11 - access("C2"."ID"="T2"."C22")

Solving the Mystery

What has all this to do with the strange mystery at the beginning and the title of this blog post?

After some research we discovered that all codes of one domain were missing. They were deleted from table C and reloaded with new surrogate keys. Of course, this should never happen in a production environment, but on a test database everything is possible. Because all foreign key constraints in this DWH database are defined with RELY NOVALIDATE, Oracle does not check the data consistency.

For this inconsistent data set, the correct result set of the view should be empty. But with Join Elimination, all rows were returned because the missing codes were not in the select list. Therefore the result is different, depending on whether Join Elimination is used or not. This is not an Oracle bug, but the planned behavior. Join Elimination can only be used when the foreign keys are either validated or defined as RELY. And “rely” means that Oracle relies on consistent data. It is our fault when we load inconsistent data into the Data Warehouse. After reloading a correct set of data into our test database, the problem was solved.

Was it really? This problem with the different results was solved, but not a much severe one. Due to this silly mistake of wrong test data, I realized that the usage of Join Elimination can be different during a SELECT and an INSERT statement. This may cause very bad performance for some of the ETL jobs. Now I know that I have to investigate the execution plans even more carefully to be sure that only the required code tables are joined.

Sometimes it’s good to make mistakes. They help to find other issues.

Advertisements

One thought on “The Advantage of Silly Mistakes

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s