Gathering Statistics in the Autonomous Data Warehouse Cloud

Optimizer statistics are essential for good execution plans and fast performance of the SQL queries. Of course, this is also the case in the Autonomous Data Warehouse Cloud. But the handling of gathering statistics is slightly different from what we know from other Oracle databases.

 

Since a couple of days, I’m testing several features and behaviors of the Autonomous Data Warehouse Cloud (ADWC) to find out, how useful this Cloud platform solution is for real DWH projects and what has to be considered for the development of a Data Warehouse. To simulate a typical scenario, I’m running incremental load jobs into multiple target tables several times per day. The example I use for this is a Data Vault schema for a craft beer brewery (if you want to know more about the data model, watch this video I recorded last year). It the simulated environment on ADWC, I already sold 68 million beers until today – far away from what we sell in our real micro brewery. But this is not the subject I want to write about in this blog post.

Craft beer dv

 

More interesting than the data (which is mostly generated by DBMS_RANDOM) is the fact that no optimizer statistics were gathered so far, although the system is running since more than a week now. I play the role of a “naive ETL developer”, so I don’t care about such technical details. That’s what the Autonomous Data Warehouse should do for me.

Managing Optimizer Statistics in ADWC

For this blog post, I switch my role to the interested developer, that wants to know why there are statistics available. A good starting point – as often – is to read the manual. In the documentation of ADWC, we can find the following statements in the section Managing Optimizer Statistics and Hints on Autonomous Data Warehouse Cloud:

Autonomous Data Warehouse Cloud gathers optimizer statistics automatically for tables loaded with direct-path load operations. … If you have tables modified using conventional DML operations you can run commands to gather optimizer statistics for those tables. …

What does this mean exactly? Let’s look at some more details.

Statistics for ETL Jobs with Conventional DML

The automatic gathering statistics job that is executed regularly on a “normal” Oracle database, does not run on ADWC. The job is enabled, but the maintenance windows of the scheduler are disabled by default:

Auto stats job

This is a good decision, because many data warehouses are running ETL jobs in the time frame of the default windows. Statistics gathering in a data warehouse should always be part of the ETL jobs. This is also the case in ADWC. After loading data into a target table with a conventional DML operation (INSERT, UPDATE, MERGE), the optimizer statistics are gathered with a DBMS_STATS call:


BEGIN
   dbms_stats.gather_table_stats(USER, ‘H_ORDER_ITEM’);
END;
  

 

Only schema and table name must be specified as parameter. For all other settings, the DBMS_STATS preferences are used. Four of them are defined differently per default in Autonomous Data Warehouse Cloud:

  • INCREMENTAL is set to TRUE (default: FALSE). This is only relevant for incremental statistics on partitioned table. Currently, Partitioning is not supported on ADWC, so this preference has no impact.
  • INCREMENTAL_LEVEL is set to TABLE (default: PARTITION). This is relevant for partition exchange in combination with incremental statistics and therefore currently not relevant, too.
  • METHOD_OPT is set to ‘FOR ALL COLUMNS SIZE 254’ (default: … SIZE AUTO). With the default setting, histograms are only gathered if a column was used in a WHERE condition of a SQL query before. In ADWC, a histogram with up to 254 buckets i calculated for each column, independent of the queries that were executed so far. This allows more flexibility for ad-hoc queries and is suitable in a data warehouse environment.
  • NO_INVALIDATE is set to FALSE (default: DBMS_STATS.AUTO_INVALIDATE). For ETL jobs, it is important so set this parameter to FALSE (see my previous blog post Avoid dbms_stats.auto_invalidate in ETL jobs). So, the preference setting is a very good choice for data warehouses.

The configuration of ADWC makes it very easy to gather optimizer statistics in your ETL jobs, but you still have to make sure that a DBMS_STATS call is included at the end of each ETL job.

Statistics for ETL Jobs with Direct-Path Loads

A better approach is to use Direct-Path INSERT statements. This is not only faster for large data sets, but makes it much easier to manage optimizer statistics. The reason is an Oracle 12c feature and two new undocumented parameters.

Since Oracle 12.1, statistics are gathered automatically for a Direct-Path INSERT. This works only for empty tables, and no histograms are calculated, as explained in my previous blog post Online Statistics Gathering in Oracle 12c.

In ADWC, two new undocumented parameters are available, both are set to TRUE by default:

  • _optimizer_gather_stats_on_load_all”: When this parameter is TRUE, online statistics are gathered even for a Direct-Path operation into a non-empty target table.
  • _optimizer_gather_stats_on_load_hist”: When this parameter is TRUE, histograms are calculated during online statistics gathering.

The following code fragment shows this behavior: Before an incremental load into the Hub table H_ORDER_ITEM, the number of rows in the table statistics is 68386107. After inserting another 299041 rows, the table statistics are increased to 68685148 (= 68386107 + 299041).

 
SELECT table_name, num_rows, last_analyzed
 FROM user_tab_statistics

 WHERE table_name = ‘H_ORDER_ITEM’;
 
TABLE_NAME             NUM_ROWS LAST_ANALYZED
——————– ———- ——————-
H_ORDER_ITEM          
68386107 11.07.2018 09:37:04
 
 
INSERT /*+ append */ INTO h_order_item
     ( h_order_item_key
     , order_no
     , line_no
     , load_date
     , record_source
     )
SELECT s.h_order_item_key
     , s.order_no
     , s.line_no
     , v_load_date
     , c_record_source
  FROM v_stg_order_details s
  LEFT OUTER JOIN h_order_item t
    ON (s.h_order_item_key = t.h_order_item_key)
 WHERE t.h_order_item_key IS NULL;
 
299041 rows inserted.
 
COMMIT;
 
SELECT table_name, num_rows, last_analyzed
  FROM user_tab_statistics
 WHERE table_name = ‘H_ORDER_ITEM’;
 
TABLE_NAME             NUM_ROWS LAST_ANALYZED
——————– ———- ——————-
H_ORDER_ITEM           68685148 11.07.2018 14:11:09
~                                                                                                                                  

 

The column statistics (including histograms) are adapted for the target table, too. Only index statistics are not affected during online statistics gathering – but indexes in ADWC are a different story anyway. I will write about it in a separate blog post.

Conclusion

Statistics gathering is still important in the Autonomous Data Warehouse Cloud, and we have to take care that the optimizer statistics are frequently been updated. For Direct-Path operations, this works automatically, so we have nothing to do anymore. Only for conventional DML operations, it is still required to call DBMS_STATS after each ETL job, but the default configuration of ADWC makes it very easy to use.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s