Recently I realized that the dbms_stats parameter no_invalidate is important for statistics gathering in ETL jobs. I never considered this parameter because “the default value will be good enough” – as for most other parameters of the statistics gathering procedures. But in this case, this is obviously not true.
In ETL jobs it is recommended to gather the optimizer statistics after each data load. In many cases this task is performed as the very last step of an ETL run. A more advanced approach is to gather the statistics after each step. For example, if a stage table is loaded, it is useful to gather the statistics for this table before further ETL processing is executed. This helps the optimizer to find the best execution plan for the successive steps of the ETL job. As long as you just load data from one table to another without any transformations, filters or lookups, this is not crucial. But when an ETL step joins several tables, filters some of the rows or evaluates key/code values from multiple lookup tables, the join order and join methods can be quite important for the performance of the ETL job.
Example (part 1)
Let’s assume, we load sales transactions of a grocery store into a Data Warehouse every two hours. The stage table STG_SALES is truncated and loaded with all transaction records since the last load. After loading the stage table, the following procedure call is used to gather table statistics:
(ownname => 'DWH_STAGE',
,tabname => 'STG_SALES'
,estimate_percent => dbms_stats.auto_sample_size
,no_invalidate => dbms_stats.auto_invalidate); -- default value
The next ETL job contains a dozen key lookups for each transaction record and has therefore to join the stage table STG_SALES with 12 lookup tables (or dimension tables).
Important in this example is the parameter no_invalidate which is usually omitted. It decides whether dependent cursors in the shared pool are invalidated or not when new statistics are gathered. The following values are supported:
- FALSE: After gathering statistics on the table, all dependent cursors are invalidated immediately. In other words: For each SQL statement that accesses the table, the optimizer must calculate a new execution plan. This was the default behavior before Oracle 10g.
- TRUE: In this case, all dependent cursors stay valid as long as the are cached in the shared pool. A new execution plan is calculated only for new SQL statements (or statements that were not executed for a long time and are not in the shared pool anymore).
- AUTO_INVALIDATE: The purpose of this setting is that not all cursors are invalidated at the same time (which would cause a lot of re-parses after the statistics jobs), but within a time period of 5 hours. This is the default since Oracle 10g.
(Note: The time delay can be changed with the undocumented parameter _optimizer_invalidation_period. The default value is 18000 seconds)
The default value AUTO_INVALIDATE may be useful on a multi-user OLTP database. It helps to avoid a lot of parse time directly after running the gathering jobs. But for an ETL job as described above, this behavior is not what we want.
Example (part 2)
Let’s come back to our previous example:
On Saturday morning (8 o’clock), only very few sales transactions are loaded into STG_SALES (the grocery store just opened). Only two rows are loaded into the stage table. The successive ETL job joins these two rows with all the lookup tables. The best execution plan for this is to start with table STG_SALES and to join each of the lookup tables with a nested loops join. Everything is fine, and the ETL job finishes in a few seconds.
During the next two hours, many customers come to the store and buy a lot of food for the weekend. The next ETL job starts at 10 o’clock and loads 30000 sales transactions into the table STG_SALES. After this step, the statistics on the stage table are gathered, and the successive ETL job is started. What happens?
If we use the default setting for no_invalidate, Oracle would use the AUTO_INVALIDATE algorithm. In this case, it would use the execution plan of the previous load – but 12 nested loops joins on a driving table with 30000 rows are probably not a good idea. The ETL job will run much too long.
To avoid this, we should explicitly define no_invalidate => FALSE as in the example above. In this case, a new (and better) execution plan is calculated for each of the successive ETL jobs. In our case, the optimizer would decide to use hash joins for the lookup tables. The result is a much better performance of the ETL jobs.
My recommendation for gathering statistics in an ETL job is to always set the parameter no_invalidate to FALSE:
(ownname => 'DWH_STAGE'
,tabname => 'STG_SALES'
,estimate_percent => dbms_stats.auto_sample_size);
,no_invalidate => FALSE);
Alternatively, you can redefine the global statistics preferences with the following procedure call:
(pname => 'NO_INVALIDATE'
,pvalue => 'FALSE');
Ok, you could argue that the example is not what you have in your DWH environments. If you load data only once a day, the situation described here never happens. That’s what I though until few months ago. But then I saw exactly this situation in two Data Warehouses of different customers. On one system, it happened after a reload on the same day because of a wrong data delivery. In the other customer project it was because data is delivered in multiple files that are loaded sequentially into the DWH database. In both cases, the auto_invalidate algorithm caused a long-running ETL job.