Hash keys can be used instead of sequence numbers to build surrogate keys in a Data Warehouse. This is often done in Data Vault environments, and there are some good reasons to do that. But there are also several issues and restrictions with hash keys. So, use hash keys only if you have good reasons for them. And if you decide to use hash keys in your Oracle Data Warehouse, make sure that you do it the right way.

A hash function is a deterministic function that distributes a (usually high) number of input values to a set of distinct hash values. Hash algorithms are used for different purposes in computer science (e.g. compilers and parsers, in cryptography or for key generators). In the Oracle database, hash functions are used for hash joins, hash partitioning and hash clusters – to name just a few examples.

To illustrate the principle of a hash function, I use the expression MOD(n, 7) as a hash function in the following example. The function is deterministic, i.e. for the same input value, the result is always the same. The result returns the remainder of a division by 7, therefore always a number between 0 and 6.

*Basic principle of a hash function*

As you can see in the example above, the hash values are not unique. For the input values 15 and 22, the hash key (1) is the same. That is called a hash collision. For many situations, this is not an issue. If a hash algorithm is used for equal distribution of rows to multiple hash partitions, it is clear that more than one row is stored in each partition. But for a hash key generator, hash collisions have to be avoided.

In this blog post, I will write about a special use case of hash functions: Building hash keys in a Data Warehouse to replace the business key with a surrogate primary key. In this case, the business key (one or more attributes of a source table) is the input value. The result of the hash function is a hash key. Hash keys are often used in Data Warehouses with Data Vault models, they are even one of the main concepts of Data Vault 2.0. But hash keys can also be used in dimensional models as primary keys of the dimension tables, in historized head & version models, or in any 3NF data model.

# Sequence Numbers or Hash Keys?

There are many discussions about the right approach to build surrogate keys. Should I use sequence numbers or hash keys? Or even UUIDs? As usual in real life, there are advantages and disadvantages for each of the concepts. Unfortunately, there is an almost religious “key discussion” in the Data Vault community. What is my opinion? I don’t want to be a “good guy” or a “bad guy”, depending on whether I use sequence numbers or hash keys. For me as a Data Warehouse consultant, it is important to understand the principle of the different concepts, their advantages and disadvantages. Then I can decide for each project which approach fits best for the specific requirements and technology. Here just a short summary of the two approaches:

**Sequence Numbers**: For every new business key in a Hub, a new sequence number is created. In Oracle, this can be implemented classically with a sequence, or in Oracle 12c with an identity column. To determine the foreign keys on Links or Satellites, a key lookup on the Hub is needed. For each business key, the corresponding surrogate key is derived from the Hub. Key lookup operations are very common in Data Warehouses and are available in every ETL tool. In SQL, a key lookup is usually just a (left outer) join on the lookup table (here on the Hub) to determine the correct sequence number.

*Sequence numbers with key lookup*

The main issue of this approach is the load dependency: All Hubs must be loaded before the Links and Satellites to guarantee that all key lookups return a result. Additionally, the sequence numbers in different environments (e.g. development, test and production database) are not in sync. If data has to be compared or moved from one environment to another, the sequence numbers must be recalculated. The third challenge is the reload of data from scratch: If – for any reason – a table is truncated and initially loaded, each row will get a new sequence number, so the dependent tables must be reloaded as well.

**Hash Keys**: If a hash algorithm is used to determine the surrogate keys, the hash key can be derived from the business key directly, without a key lookup. Because a hash function is deterministic, it will always return the same hash key for the same business key: On the Hub as well as on the referring Links and Satellites. Therefore, there is no special load order required, all tables can (theoretically) be loaded in parallel. And because of the deterministic function, the hash keys would be the same in multiple environments and after a reload of the same data.

*Hash keys calculated with hash function*

The simultaneous load of all Hubs, Links and Satellites is not always possible. If you have foreign key constraints defined on your Data Vault tables, you still have to load the Hubs before the Links and Satellites. One option would be to omit the constraints (which I definitely not recommend). Reliable constraints can help here (see my blog post Foreign Key Constraints in an Oracle Data Warehouse). But if you don’t rely on the source data and want to prove data quality with constraints, you still need a predefined load order, even with hash keys.

Another issue of hash keys is the risk of hash collisions. Hence it is important to choose a suitable hash function. The probability of a hash collision with MD5 or SHA-1 algorithm (see below) is very small – but theoretically still exists. So, what happens if two business keys would result in the same hash key? In the best case, the load job fails with a primary key violation. In the worst case, data of two different business entities is mixed, e.g. sales orders of one customer are assigned to another customer. Sounds horrible, but in many Data Warehouses, this is not a real issue. The risk of wrong results because of bad source data quality or development bugs in the ETL jobs is higher than the chance of a hash collision.

The decision whether to use sequence numbers or hash keys depends on many aspects. If you have good reasons to use hash keys (or sequence numbers), then use them. But don’t do it just because “it is said to be used”.

# The Wrong Way: ORA_HASH

A well-known hash function in Oracle is the SQL function ORA_HASH. This is a useful function to distribute data to multiple subsets, but the generated hash keys are far away from uniqueness. Many people are impressed by the maximum number of buckets (i.e. the number of possible return values) of this hash function. The default (and highest possible value) of the optional parameter MAX_BUCKET is 4294967295. That means, the hash function returns 32 bits – not very much. So, hash collisions will occur frequently, even with small data sets (see Connor McDonald’s blog post). The probability of two input values with the same hash key is quite high. With 9300 input rows, the probability is 1%, with 50’000 rows already 25%, and with 77’000 rows 50%.

The mathematical background for this behavior is the so-called **Birthday Problem**: What is the probability that two people have their birthday on the same date of the year, depending on the total number of people? Did you know, that with only 23 people, this probability is already 50%? A good explanation of the birthday problem can be found on Wikipedia. If you don’t like mathematics and prefer more practical exercises, run the following query on an Oracle database:

WITH input_data AS

(SELECT ROWNUM n, ORA_HASH(ROWNUM) hash

FROM dual CONNECT BY ROWNUM <= 1000000)

SELECT hash, COUNT(*)

FROM input_data

GROUP BY hash HAVING COUNT(*) > 1;

The result is 122 hash collisions for an input of 1 million rows. With 100’000 rows, there are already 3 collisions. Definitely not a good choice for a Data Warehouse! So, we need a better hash function.

# Cryptographic Hash Functions in Oracle

Several common cryptographic hash algorithms are available that are suitable to generate (almost) unique hash keys with a very small probability of hash collisions. The most famous ones are MD5 (message-digest algorithm) as well as SHA-1 and SHA-2 (secure hash algorithm). SHA-2 consists of multiple variants with a different number of output bits. The higher the number of output bits, the smaller is the probability of a hash collision.

The Oracle database contains different implementations of cryptographic hash functions, but not all of them are available in every version. Here an overview of the different packages and functions.

DBMS_OBFUSCATION_TOOLKIT: This PL/SQL package contains some overloaded procedures and functions to generate MD5 hash keys. The package is deprecated since Oracle 11.2 and should not be used anymore. Because of the overloaded versions of the MD5 function, it is not possible to use the function directly in SQL. Instead, a PL/SQL function must be written to be called in SQL.

CREATE OR REPLACE

FUNCTION hash_key (v_input VARCHAR2)

RETURN dbms_obfuscation_toolkit.varchar2_checksum

AS

BEGIN

RETURN dbms_obfuscation_toolkit.md5(input_string => v_input);

END hash_key;

HASH_MD4 CONSTANT PLS_INTEGER := 1;

HASH_MD5 CONSTANT PLS_INTEGER := 2;

HASH_SH1 CONSTANT PLS_INTEGER := 3;

HASH_SH256 CONSTANT PLS_INTEGER := 4;

HASH_SH384 CONSTANT PLS_INTEGER := 5;

HASH_SH512 CONSTANT PLS_INTEGER := 6;

Unfortunately, the constants cannot be used directly in SQL. Instead, the corresponding numbers must be added as parameters, and an explicit type conversion to RAW, CLOB or BLOB must be included in the statement. So, it is still easier to define a wrapper function in PL/SQL:

CREATE OR REPLACE

FUNCTION hash_key (v_input VARCHAR2) RETURN RAW DETERMINISTIC

AS

PRAGMA UDF;

BEGIN

RETURN dbms_crypto.hash(utl_raw.cast_to_raw(v_input), dbms_crypto.HASH_MD5);

END hash_key;

STANDARD_HASH: This new SQL function was introduced with Oracle 12.1. It supports the same hash algorithms as DBMS_CRYPTO, but is easier to use in SQL – and much faster. The first parameter is an input value of any data type except LONG and LOB. The second (optional) parameter is a string that defines the hash algorithm. Default is ‘SHA1’, other possible values are ‘MD5’, ‘SHA256’, ‘SHA384’ and ‘SHA512’. The function can be used directly in SQL:

WITH input_data AS

(SELECT ROWNUM n, STANDARD_HASH(ROWNUM, ‘MD5') hash

FROM dual CONNECT BY ROWNUM <= 1000000)

SELECT hash, COUNT(*)

FROM input_data

GROUP BY hash HAVING COUNT(*) > 1;

The statement generates 1 million hash keys without any hash collisions. This is also the case for all functions described above (except ORA_HASH). The different implementations of the Oracle cryptographic hash functions are all suitable to generate unique hash keys, but the performance is different. The query above was used to run all of the available hash algorithms of the PL/SQL packages and the SQL function. The following table compares the execution times.

*SQL runtime to generate 1 million has keys*

As you can see from the execution times, the SQL function STANDARD_HASH is much faster than DBMS_CRYPTO. So, if you decided to use hash keys in your Data Warehouse and you work with Oracle 12c, the best option is to use the function STANDARD_HASH. The default hash algorithm SHA-1 is the best compromise: 160 bits are enough to avoid hash collisions, but the performance is better than with the SHA-2 algorithms. DBMS_CRYPTO should only be used for backward compatibility, and DBMS_OBFUSCATION_TOOLKIT should not be used anymore at all.