10 Tips to Improve ETL Performance

In summer time, the nights are very short. For some Data Warehouses, this is the case all year round, but not because of late sunset and early sunrise. The night is not long enough to finish all the ETL jobs. Long-running load jobs that run for several hours are not a seldom situation. Here some tips how this can be avoided.

The last few days, I did some performance tuning of ETL jobs for a customer project. Several load jobs were running for more than two hours every night, some of them never finished and crashed with the famous error message you probably know:

ORA-01652: unable to extend temp segment by 128 in tablespace TEMP

I had to investigate and improve about a dozen different load jobs. I was surprised that it was possible to solve all performance problems with only a few patterns. These rules are important when implementing SQL statements for ETL jobs. After adapting the SQL statements, most of the jobs are now executed in a few seconds, some of the still need a couple of minutes. Even though such extreme performance improvements are not always possible, it is still worth to consider some basic rules for implementing efficient ETL jobs in SQL.

Midnight
What can we do when the night is too short? Sunset/sunrise in Alta, Norway, 12 July 2016, 00:13

Loading large amounts of data into a Data Warehouse is a completely different situation than executing queries in an OLTP system. If you load your Data Warehouse with SQL statements in scripts, PL/SQL packages or views, or if you use an ETL tool that is able to execute SQL commands, the following tips may help you to implement fast ETL jobs or to improve the performance of long-running jobs.

1. Use Set-based Operations

The first tip should be obvious to every ETL developer: Set-based operations in SQL run usually faster than row-based executions of procedural languages. An INSERT/SELECT statement or a CREATE TABLE AS SELECT is a much better solution than a cursor loop in PL/SQL. Even for complex transformations, it is often possible – and recommended – to rewrite a row-based cursor loop into a set-based SQL statement. If you worked with Oracle Warehouse Builder, you know how important it was to change the mapping configuration to “set-based”.

ELT tools like Oracle Data Integrator are able to execute SQL statements directly on the database. Many ETL tools provide special features or options to run load jobs in set-based ELT mode. If this is not possible, they often support bulk processing, i.e. multiple rows can be inserted in one step.

All of the following tips are based on the assumption that you (or your ETL tool) are able to run the load jobs with SQL statements.

2. Avoid Nested Loops

A Nested Loops Join is a wonderful and efficient join method for small result sets in an OLTP application. If you want to select a few rows out of a big table and join these rows with another table, nested loops in combination with index scans are the fastest way to retrieve the required rows. But ETL jobs usually read many (or all) rows of an input table and join them with other tables. For this kind of queries, a Hash Join is definitely a better choice.

When I check the execution plan of an ETL statement, I first have a look at the join methods. If the execution plan contains any Nested Loops Joins, this is often an indicator for poor performance. The optimizer decides to use this join method if the selectivity is small, i.e. if only a small subset of the data is read. When this is really the case, it is a good decision, but quite often the optimizer is “confused” by unnecessary indexes or complex WHERE conditions (see tips 3 and 4).

3. Drop Unnecessary Indexes

Many Data Warehouses are “over-indexed”. Unlike in OLTP systems, indexes are rarely used in a Data Warehouse. But for many database administrators and developers, creating indexes is still their favorite approach to “solve” any performance problem: “The query runs slow, so let’s create another index”. For ETL jobs, this usually doesn’t help, it even increases the load times. Every additional index slows down the DML performance of INSERT, UPDATE or MERGE statements and – even worse – can cause the optimizer to use a Nested Loops Join (see tip 2). ETL jobs work on large data sets, not on a small subset of the data. An index is designed for a selective query that returns only a small percentage of the rows of a table.

My recommendation is to use as few indexes as possible in a Data Warehouse. They are only mandatory for primary keys and unique constraints. In Data Marts, where users run queries, some indexes may be useful for a good query performance. This is typically the case on star schemas where bitmap indexes are created on the fact tables.

4. Avoid Functions in WHERE Condition

There can be several reasons for wrong estimations of the query optimizer. A common pitfall are expressions or function calls in WHERE conditions. If you use an SQL function like UPPER, SUBSTR, TO_CHAR, etc. in a filter condition, it is much harder for the optimizer to estimate the cardinality than if you use just unmodified columns. Of course, it is not always possible to avoid such transformations. But sometimes it is easier and much faster to do the transformations in a previous step (e.g. in a “Cleansing Area” or in an intermediate temporary table). This helps to avoid poor estimations by the optimizer.

The worst case are PL/SQL function calls. They are like a “black box” for the optimizer and typically scale down the estimations dramatically. Let’s assume a query on a table with 1 million rows. The query contains three PL/SQL function calls in the WHERE condition, combined with AND operators. The optimizer supposes a selectivity of 1% (0.01) for each function. The cardinality for this query would be 0.01 * 0.01 * 0.01 * 1000000, that is 1 row. With such a small cardinality, it is very likely that the optimizer would choose a Nested Loops Join.

5. Take Care of OR in WHERE Condition

When I see the OR keyword in WHERE or in a JOIN condition, I look very carefully at the execution plan. Although the optimizer is able to estimate the selectivity of multiple conditions combined with an OR, it is sometimes “confused” with complex WHERE conditions. In many cases, conditions with OR can be replaced by other expressions. For example, the expression “attr = 4711 or attr IS NULL” can be replaced by “NVL(attr, 4711) = 4711″.

In some cases, the Oracle optimizer transforms OR conditions into Lateral Inline Views. Although this might be useful under some circumstances, it may cause suboptimal execution plans. Sometimes, its is even faster to split an SQL statement with an OR in the WHERE condition into two separate statements. A UNION of these two statements runs much faster than the original statement with OR.

6. Reduce Data as Early as Possible

The earlier the amount of data can be reduced, the less work has Oracle to do to read and join the relevant rows of each table. This is the most important rule for performance tuning in OLTP applications with selective queries. But it may also help to improve performance of an ETL job. Let’s assume we read one big input table that has to be joined with several smaller tables. In this case, it is faster to join all the small tables first, before the result set is “blown up” with the high number of rows of the big table.

In most cases, the optimizer is able to derive the correct join order based on the cardinality of each table, i.e. the number of estimated rows. With Hash Joins, our preferred join method in ETL jobs, it will always use the smaller table as the driving table of a join. But for SQL statements with complex WHERE conditions it may happen that the optimizer decides to read the biggest table too early. This issue can be solved with a /*+ leading */ hint or – more elegant – with the usage of a subquery factoring clause, as described in the next section (see tip 7).

7. Use WITH to Split Complex Queries

I’m a fan of the WITH clause, or the Subquery Factoring Clause, how it is officially called by Oracle. It allows to split an extensive SELECT statement into separate parts. This makes it easier to read the SQL statement, similar to local procedures in a complex PL/SQL package. But WITH clauses have additional advantages, too. Because each subquery factoring clause can refer all other clauses that were defined before, it is very easy to debug the statement and to test the performance of the separate parts.

The optimizer executes a subquery factoring clause either as an inline view or as a temporary table. Especially the optimization as a temporary table is very useful for complex ETL statements. It allows to split a complex job into separate smaller parts that can be executed much faster. Instead of loading an intermediate table that is used as input for the next ETL job, these steps can be combined in one SQL statement, using a WITH clause. And if the optimizer does not realize that the fastest way to run the query is to load a temporary table, we can force this plan with a /*+ materialize */ hint in the subquery factoring clause.

8. Run Statements in Parallel

For loads of large data sets, it is highly recommend to run the jobs in parallel. Oracle is able to execute SQL statements with multiple parallel processes. Parallel execution is enabled for queries and DDL commands by default. For DML statements, it must be enabled with an ALTER SESSION ENABLE PARALLEL DML command. Additionally, the parallel degree of all tables must be set to an appropriate value, and the database configuration must be adapted. For further details see blog post Oracle Initialization Parameters for Data Warehouses.

But even if the configuration is suitable for parallel execution, this does not guarantee that every SQL statement runs in parallel. The optimizer decides for  each statement – based on the selectivity – whether is will be executed in parallel or serial mode. You can take over this decision from the optimizer with the hints /*+ parallel */ or /*+ no_parallel */.

9. Perform Direct-Path INSERT

Direct-Path INSERT is an efficient way to load large data sets into a target table. In contrast to Conventional INSERT statements, this method is much faster because new data is always appended in additional table blocks at the end of the table (above the “high water mark”). This avoids the effort for reusing of free space in existing table blocks and reduces the undo and redo log information. Indexes are updated at the end of the statement, not for every inserted row separately. Direct-Path INSERT is enabled explicitly with the /*+ append */ hint or is implicitly used for Parallel DML statements (see tip 8). A similar hint is /*+ append_values */ which can be used to enforce Direct-Path INSERT in bulk insert operations, e.g. in a FORALL statement in PL/SQL.

Even for MERGE statements, Direct-Path INSERT can be used for the insert part. Note that in this case, the /*+ append */ hint must be specified after the MERGE keyword, not in the WHEN NOT MATCHED THEN INSERT clause.

There are several restrictions do be considered. The most important one: If you have enabled foreign key constraints on the target table, Direct-Path INSERT does not work. In this case, you must disable the constraints during the load. This is one of the reasons why reliable constraints are often used in Data Warehouses (see blog post Foreign Key Constraints in an Oracle Data Warehouse).

10. Gather Statistics after Loading each Table

The last step of every ETL job should always gathering statistics on the target table. This is especially important for tables that are reloaded from scratch (e.g. stage tables or intermediate tables of complex ETL jobs. Missing or wrong statistics can cause poor estimations for the next ETL steps. If an ETL job joins several stage tables that contain outdated statistics, the risk of a poor execution plan is very high.

There are several features in Oracle to handle this issue, e.g. Dynamic Sampling or Adaptive Plans (in Oracle 12c). But nevertheless is is a good advice to have current statistics for each step of a complex ETL run. Make sure that you gather the statistics with the dbms_stats parameter no_invalidate => FALSE (see blog post Avoid dbms_stats.auto_invalidate in ETL jobs). Since Oracle 12c, it is very convenient to gather table statistics on initially loaded tables. If a table is truncated and then loaded with a Direct-Path INSERT, or if it is loaded with CREATE TABLE AS SELECT, statistics are gathered automatically (see blog post Online Statistics Gathering in Oracle 12c).

Conclusion

Of course, these ten performance tips are not a complete reference for performance tuning of ETL jobs. But maybe they help you to reduce the load times of some ETL jobs. The more load jobs you improve by rewriting the SQL statements, the more you will see that the reasons for bad performance are often related to the same basic principles.

Some of the explanations above are very short, probably too short. I decided to omit any SQL examples in this blog post, otherwise it would be too long. If you are interested in more details of a particular subject, let me know with a reply to this blog post (see below) or via Twitter. This will give me some new ideas to write future blog posts. Hopefully, the nights are long enough for this.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s