Delta Detection in Oracle SQL

Delta detection is a common task in every Data Warehouse. It compares new data from a source system with the last versions in the Data Warehouse to find out whether a new version has to be created. There are several ways to implement this in Oracle.

Your source system delivers a full extraction every night, and you have to load only the changed rows into your Core Data Warehouse? You receive incremental loads from another source system every few minutes, but only a few columns are loaded into the Data Warehouse? You want to load Satellites into a Data Vault layer of your Data Warehouse? You need to populate dimension tables of Slowly Type Dimension type 2? Or you want to avoid updates of all rows in a large SCD1 dimension table?

In all these situations, you need a delta detection mechanism to identify the rows that have to be inserted or updated in your Data Warehouse tables. In this blog post, I want to show different methods in Oracle SQL that provide the subset of rows of a source table that were changed since the last load. All these methods are set-based, i.e. they can be executed in one SQL statement.

Basic Idea: Full (or Left) Outer Join

All delta detection methods described below use the same basic principle. With a full outer join between source table (current state of the data) and target table (last loaded version) you can identify all rows that were either inserted or deleted in the source table since the last load, and all existing rows that differ in at least one column. The join is done on the business key or the primary key columns.

  • If a row exists only in the source, but not in the target table, it is a new record inserted in the source system. In the target table, a new row has to be inserted.
  • If a row exists only in the target, but not in the source table, the record was deleted in the source system. This case is only possible if the source system delivers a full extraction of all data with every load job. Typically, we must not delete the rows from the target table in the Data Warehouse, but we can mark them as deleted (e.g. with a delete flag or a validity end date).
  • If a row exists in both tables, and if one or more columns are different in source and target, the record was updated in the source system. Depending on the situation, either a new version is inserted in the target table (e.g. a versioned Core table, a Satellite in a Data Vault model, or an SCD2 dimension in a Star Schema), or the exiting row in the target table is updated (e.g. SCD1 dimension in a Star Schema). Optionally, the end date of the previous version is updated.

In many situations, the deleted rows are not relevant or cannot be detected. This is the case if only partial extractions of the source systems are delivered. In this case, the full outer join can be replaced by a left outer join. For the following examples, I will use the full outer join, anyway – just to show how it works.

To explain the different methods, I just use SELECT statements. In a real project, the SELECT could be part of a Direct-load INSERT or MERGE statement, or a view that is used as source of an ETL process. The SELECT statements are practical to demonstrate the different methods. If you want to test them, you can prepare two identical copies of the SCOTT.EMP table:

CREATE TABLE emp_source AS SELECT * FROM scott.emp;
CREATE TABLE emp_target AS SELECT * FROM scott.emp; 
Now, execute some INSERT, UPDATE and DELETE statements on the source table. The following SQL statement should then display all new, changed and deleted rows and the type of DML operation that was performed for each row:
SELECT CASE
WHEN t.empno IS NULL THEN 'INS'
WHEN s.empno IS NULL THEN 'DEL'
ELSE 'UPD'
END dml_flag
, NVL(s.empno, t.empno) empno
, s.ename, s.job, s.mgr, s.hiredate, s.sal, s.comm, s.deptno
  FROM emp_source s
  FULL JOIN emp_target t ON (s.empno = t.empno)
 WHERE (t.empno IS NULL) — inserted row
OR (s.empno IS NULL) — deleted row
    OR (s.ename != t.ename)
    OR (s.job != t.job)
    OR (s.mgr != t.mgr)
    OR (s.hiredate != t.hiredate)
    OR (s.sal != t.sal)
    OR (s.comm != t.comm)
    OR (s.deptno != t.deptno);

What’s wrong with this statement? As long as all columns are NOT NULL, it works fine. But NULL values are not handled. If for example a commission changes from NULL to 100, the condition (s.comm != t.comm)  is evaluated to (NULL != 100). Equality or inequality tests with NULL values are never true, so a change from NULL to any value or vice versa is not tracked. If columns may contain NULL values, we have to improve our statement.

Method 1: NVL

One method to handle NULL values in delta comparisons is the usage of the NVL function. When a column is NULL, a default value is returned instead. This is done for each column in the WHERE condition – or at least for those that are nullable. In our example, let’s assume that all columns may be NULL, In this case, the statement looks like this:

SELECT CASE
WHEN t.empno IS NULL THEN 'INS'
WHEN s.empno IS NULL THEN 'DEL'
ELSE 'UPD'
END dml_flag
, NVL(s.empno, t.empno) empno
, s.ename, s.job, s.mgr, s.hiredate, s.sal, s.comm, s.deptno
  FROM emp_source s
  FULL JOIN emp_target t ON (s.empno = t.empno)
 WHERE (NVL(s.ename, '(null)') != NVL(t.ename, '(null)'))
    OR (NVL(s.job, '(null)') != NVL(t.job, '(null)'))
    OR (NVL(s.mgr, -999999) != NVL(t.mgr, -999999))
    OR (NVL(s.hiredate, TO_DATE('00010101', 'YYYYMMDD')) != NVL(t.hiredate, TO_DATE('00010101', 'YYYYMMDD')))
    OR (NVL(s.sal, -999999) != NVL(t.sal, -999999))
    OR (NVL(s.comm, -999999) != NVL(t.comm, -999999))
    OR (NVL(s.deptno, -999999) != NVL(t.deptno, -999999));

The expressions to identify new and deleted rows are not needed anymore. The (non-existing) row in the compared table contains NULL values for all columns. With the correct handling of NULL values, the new and deleted rows are returned by the query as well.

The default values in the NVL statement must have the same data type as the column. So, we need different constants, depending on the data types. The default values should be constants that do not exist in the table, otherwise there is a risk of loosing some changes. If the missing hire date of an employee is set to 1st January 0001 (which is rather unlikely in most companies), this change would not be tracked. In most cases, this is only a theoretical issue.

The disadvantage of this method is that it is hard to implement because of the different default values. Developers must very carefully define the correct constant for each column. If the statement is generated with a DWH automation tool, this is simple, but manual development is tricky and error-prone with this method.

Method 2: LNNVL

To avoid separate default constants for each data type, the column comparison can be implemented with the LNNVL function of Oracle. The function takes a condition as input parameter and returns TRUE if the condition is either FALSE or UNKNOWN. If the condition is TRUE, the function LNNVL returns FALSE. Confusing? The example with LNNVL(attr1 = attr2) shows how it works:

attr1 attr2   LNNVL(attr1 = attr2)
——————————————-------------———-------------------
1 1 FALSE
1 2 TRUE
1 NULL TRUE
NULL 2 TRUE
NULL NULL TRUE 

For the requirements of delta detection, is is almost what we need: LNNVL returns TRUE if the two columns are different of if one of them is NULL. Unfortunately, it is also TRUE, if both colums are NULL (last line in table above). To exclude this case, an additional condition NVL(attr1, attr2) IS NOT NULL must be included for each column pair. The resulting query is a bit hard to understand, but it works for all data types:

SELECT CASE
WHEN t.empno IS NULL THEN 'INS'
WHEN s.empno IS NULL THEN 'DEL'
ELSE 'UPD'
END dml_flag
, NVL(s.empno, t.empno) empno
, s.ename, s.job, s.mgr, s.hiredate, s.sal, s.comm, s.deptno
  FROM emp_source s
  FULL JOIN emp_target t ON (s.empno = t.empno)
 WHERE (LNNVL(s.ename = t.ename) AND NVL(s.ename, t.ename) IS NOT NULL)
    OR (LNNVL(s.job = t.job) AND NVL(s.job, t.job) IS NOT NULL)
    OR (LNNVL(s.mgr = t.mgr) AND NVL(s.mgr, t.mgr) IS NOT NULL)
    OR (LNNVL(s.hiredate = t.hiredate) AND NVL(s.hiredate, t.hiredate) IS NOT NULL)
    OR (LNNVL(s.sal = t.sal) AND NVL(s.sal, t.sal) IS NOT NULL)
    OR (LNNVL(s.comm = t.comm) AND NVL(s.comm, t.comm) IS NOT NULL)
    OR (LNNVL(s.deptno = t.deptno) AND NVL(s.deptno, t.deptno) IS NOT NULL);

Method 3: SYS_OP_MAP_NONNULL

The (undocumented) function SYS_OP_MAP_NONNULL replaces NULL values by an internal placeholder. This allows to compare all values with an equal sign, even when they are NULL. The expression SYS_OP_MAP_NONNULL(NULL) = SYS_OP_MAP_NONNULL(NULL) is always TRUE. With this function, the comparison of all columns is easier than with the previous methods. Because the function is officially still undocumented, you have to decide yourself whether you want to use it in your ETL processes or not.

SELECT CASE
WHEN t.empno IS NULL THEN 'INS'
WHEN s.empno IS NULL THEN 'DEL'
ELSE 'UPD'
END dml_flag
, NVL(s.empno, t.empno) empno
, s.ename, s.job, s.mgr, s.hiredate, s.sal, s.comm, s.deptno
  FROM emp_source s
  FULL JOIN emp_target t ON (s.empno = t.empno)
WHERE (SYS_OP_MAP_NONNULL(s.ename) != SYS_OP_MAP_NONNULL(t.ename))
OR (SYS_OP_MAP_NONNULL(s.job) != SYS_OP_MAP_NONNULL(t.job))
OR (SYS_OP_MAP_NONNULL(s.mgr) != SYS_OP_MAP_NONNULL(t.mgr))
OR (SYS_OP_MAP_NONNULL(s.hiredate) != SYS_OP_MAP_NONNULL(t.hiredate))
OR (SYS_OP_MAP_NONNULL(s.sal) != SYS_OP_MAP_NONNULL(t.sal))
OR (SYS_OP_MAP_NONNULL(s.comm) != SYS_OP_MAP_NONNULL(t.comm))
OR (SYS_OP_MAP_NONNULL(s.deptno) != SYS_OP_MAP_NONNULL(t.deptno));

Method 4: DECODE

In a customer project, I have seen the following interesting approach: The DECODE function compares values, even if they are NULL. The expression DECODE(attr1, attr2, 0, 1) returns 0 if the two attributes are equal, or If both are NULL. Otherwise, the result is 1. To find out whether at least one column is different, the DECODE expression is calculated on all columns, and the sum of all these flags is built. The result returns the number of different columns and can be used in the WHERE condition as follows:

SELECT CASE
WHEN t.empno IS NULL THEN 'INS'
WHEN s.empno IS NULL THEN 'DEL'
ELSE 'UPD'
END dml_flag
, NVL(s.empno, t.empno) empno
, s.ename, s.job, s.mgr, s.hiredate, s.sal, s.comm, s.deptno
  FROM emp_source s
  FULL JOIN emp_target t ON (s.empno = t.empno)
 WHERE DECODE(s.ename, t.ename, 0, 1)
     + DECODE(s.job, t.job, 0, 1)
     + DECODE(s.mgr, t.mgr, 0, 1)
     + DECODE(s.hiredate, t.hiredate, 0, 1)
     + DECODE(s.sal, t.sal, 0, 1)
     + DECODE(s.comm, t.comm, 0, 1)
     + DECODE(s.deptno, t.deptno, 0, 1) > 0;

A similar, but slightly different implementation of this approach can be found in Stew Ashtons’s blog post MERGE magic and madness.

Method 5: ORA_HASH

Instead of comparing each pair of columns, it is possible to concatenate all columns and build a hash key on the result string. After that, only the hash values have to be compared. This method is often used in Data Vault 2.0. In Oracle, we can use the ORA_HASH function. It can be executed on any input data type (except LONG and LOB) and returns a number between 0 and 4294967295 (2 ^ 32 – 1). The following SQL statement is an example of how it can be used for delta detection:

SELECT CASE
WHEN t.empno IS NULL THEN 'INS'
WHEN s.empno IS NULL THEN 'DEL'
ELSE 'UPD'
END dml_flag
, NVL(s.empno, t.empno) empno
, s.ename, s.job, s.mgr, s.hiredate, s.sal, s.comm, s.deptno
  FROM emp_source s
  FULL JOIN emp_target t ON (s.empno = t.empno)
 WHERE ORA_HASH(s.ename ||'|'||
                s.job ||'|'||
                s.mgr ||'|'||
                s.hiredate||'|'||
                s.sal ||'|'||
                s.comm ||'|'||
                s.deptno)
    != ORA_HASH(t.ename ||'|'||
                t.job ||'|'||
                t.mgr ||'|'||
                t.hiredate||'|'||
                t.sal ||'|'||
                t.comm ||'|'||
                t.deptno);

There are some details to be considered when using hash values:

  • Theoretically, there is the risk of hash collisions (i.e. two different strings will have the same hash key). The probability of a hash collision with ORA_HASH is very small.
  • A delimited should be used, that does not appear in the contents of the columns. Otherwise, the concatenation of different values may return the same result. For example, the concatenation of ‘abc’ and ‘de’ would generate the same hash value as ‘ab’ and ‘cde’. With a delimiter ‘|’, the strings ‘abc|de’ and ‘ab|cde’ are different.
  • NULL values should be replaced by a placeholder or an empty string. Because NULL and an empty string are equal in Oracle (but not in ANSI SQL), this transformation is not required here.
  • For all columns of data types NUMBER, DATE or TIMESTAMP, an implicit type conversion to VARCHAR2 is performed. This may cause problems with dates, if different NLS settings are used. For the example above, column hirre_date should be converted explicitly with the TO_CHAR function.
  • For tables with many columns or very long VARCHAR2 columns, the concatenated result string may exceed the maximum length of 4000 characters (or 32767 characters in Oracle 12c with max_string_size = extended).

Comparison

All of the delta detection methods described above will return the same results – except the very first example that does not consider NULL values. It’s more a question of style which method you prefer. For manual development, method 1 is not recommended because the default values are different depending on the data types. If the SQL statements are generated (which I recommend anyway), all methods are similar. Of course, within one project, it is strongly recommended to use only one method and not to mix them.

The performance is the same for the first four methods, even for large tables with a high number of columns. The last method, as it is implemented in the example, is slower, because the ORA_HASH function is quite expensive in CPU time. But if you store the hash value as additional column to the target table, method 5 is very useful for systems with a lot of incremental loads that contain only a small subset of the data. In this case, only the hash values for the new loaded data has to be calculated and compared to the already calculated hash values of the previous loads. If delta detection is done based on full extractions, one of the other methods is faster.

It is interesting to see that a simple problem – the comparison of all columns between two tables – can be implemented in so many different ways in SQL. And of course, this list of different methods is not complete.

Advertisements

5 thoughts on “Delta Detection in Oracle SQL

    • I use this approach quite often for adhoc queries (e.g. when testing ETL processes). It is easy and elegant, especially in combination with a subquery factoring clause (WITH clause). The issue is the performance for large data sets: both tables (source and target) have to be read twice.

      Liked by 4 people

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s